1 Requirements

requirements=c("summarytools", "pROC", "glmnetUtils", "dplyr", "car", "effects", "gridExtra", "grid", "MASS","e1071", "mgcv")

for (req in requirements){
  if (!require(req, character.only = TRUE)){
      install.packages(req)
  }
}

2 Loading data

shootings <- readRDS("data/shootings_known.Rda")
print(dfSummary(shootings), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 day_period [factor]
1. EarlyMorning
2. Morning
3. EarlyAfternoon
4. Afternoon
5. Evening
6. Night
804(5.8%)
860(6.2%)
1429(10.3%)
2799(20.2%)
2872(20.7%)
5109(36.8%)
13873 (100.0%) 0 (0.0%)
2 day_year [numeric]
Mean (sd) : 187 (96.8)
min ≤ med ≤ max:
1 ≤ 188 ≤ 366
IQR (CV) : 150 (0.5)
366 distinct values 13873 (100.0%) 0 (0.0%)
3 year [numeric]
Mean (sd) : 2013.5 (5.2)
min ≤ med ≤ max:
2006 ≤ 2013 ≤ 2022
IQR (CV) : 9 (0)
17 distinct values 13873 (100.0%) 0 (0.0%)
4 week_day [factor]
1. Monday
2. Tuesday
3. Wednesday
4. Thursday
5. Friday
6. Saturday
7. Sunday
1902(13.7%)
1685(12.1%)
1635(11.8%)
1622(11.7%)
1889(13.6%)
2590(18.7%)
2550(18.4%)
13873 (100.0%) 0 (0.0%)
5 COVID_lockdown [factor]
1. FALSE
2. TRUE
13327(96.1%)
546(3.9%)
13873 (100.0%) 0 (0.0%)
6 COVID_pandemic [factor]
1. FALSE
2. TRUE
11096(80.0%)
2777(20.0%)
13873 (100.0%) 0 (0.0%)
7 working_hour [factor]
1. FALSE
2. TRUE
11262(81.2%)
2611(18.8%)
13873 (100.0%) 0 (0.0%)
8 Latitude [numeric]
Mean (sd) : 40.7 (0.1)
min ≤ med ≤ max:
40.5 ≤ 40.7 ≤ 40.9
IQR (CV) : 0.2 (0)
7051 distinct values 13873 (100.0%) 0 (0.0%)
9 Longitude [numeric]
Mean (sd) : -73.9 (0.1)
min ≤ med ≤ max:
-74.2 ≤ -73.9 ≤ -73.7
IQR (CV) : 0.1 (0)
7047 distinct values 13873 (100.0%) 0 (0.0%)
10 city_location [factor]
1. S_Manhattan
2. N_Manhattan
3. W_Bronx
4. E_Bronx
5. N_E_Queens
6. N_W_Queens
7. S_Queens
8. N_Brooklyn
9. S_E_Brooklyn
10. S_W_Brooklyn
[ 2 others ]
405(2.9%)
1626(11.7%)
3122(22.5%)
1182(8.5%)
392(2.8%)
493(3.6%)
1172(8.4%)
2105(15.2%)
2097(15.1%)
709(5.1%)
570(4.1%)
13873 (100.0%) 0 (0.0%)
11 perp_age [factor]
1. <18
2. 18-24
3. 25-44
4. 45+
1560(11.2%)
6101(44.0%)
5559(40.1%)
653(4.7%)
13873 (100.0%) 0 (0.0%)
12 vic_age [factor]
1. <18
2. 18-24
3. 25-44
4. 45+
1527(11.0%)
4881(35.2%)
6294(45.4%)
1171(8.4%)
13873 (100.0%) 0 (0.0%)
13 perp_sex [factor]
1. F
2. M
401(2.9%)
13472(97.1%)
13873 (100.0%) 0 (0.0%)
14 vic_sex [factor]
1. F
2. M
1597(11.5%)
12276(88.5%)
13873 (100.0%) 0 (0.0%)
15 perp_race [factor]
1. ASIAN/WHITE
2. BLACK
3. BLACK HISPANIC
4. WHITE HISPANIC
409(2.9%)
10078(72.6%)
1211(8.7%)
2175(15.7%)
13873 (100.0%) 0 (0.0%)
16 vic_race [factor]
1. ASIAN/WHITE
2. BLACK
3. BLACK HISPANIC
4. WHITE HISPANIC
686(4.9%)
9363(67.5%)
1448(10.4%)
2376(17.1%)
13873 (100.0%) 0 (0.0%)
17 other_victims [numeric]
Mean (sd) : 1.2 (2.1)
min ≤ med ≤ max:
0 ≤ 0 ≤ 17
IQR (CV) : 1 (1.8)
11 distinct values 13873 (100.0%) 0 (0.0%)
18 jurisdiction [factor]
1. PATROL
2. HOUSING
11671(84.1%)
2202(15.9%)
13873 (100.0%) 0 (0.0%)
19 murder [factor]
1. FALSE
2. TRUE
10561(76.1%)
3312(23.9%)
13873 (100.0%) 0 (0.0%)
20 murder_prob [numeric]
Min : 0
Mean : 0.2
Max : 1
0:10561(76.1%)
1:3312(23.9%)
13873 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

3 Train/Test split

set.seed(2)

train <- sample(nrow(shootings), 0.80*nrow(shootings))
test <- (-train)

shootings.train <- subset(shootings[train,], select = -murder )
shootings.test <- subset(shootings[test,], select = -murder )

dim(shootings.train)
## [1] 11098    19
dim(shootings.test)
## [1] 2775   19

4 Logistic regression

Let’s start with a model with all the predictors.

glm.full <- glm(murder_prob ~ ., data=shootings.train, family=binomial)
summary(glm.full)
## 
## Call:
## glm(formula = murder_prob ~ ., family = binomial, data = shootings.train)
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.662e+02  1.002e+02  -1.658  0.09731 .  
## day_periodMorning            -2.322e-01  1.385e-01  -1.676  0.09371 .  
## day_periodEarlyAfternoon     -3.859e-01  1.292e-01  -2.988  0.00281 ** 
## day_periodAfternoon          -4.484e-01  1.057e-01  -4.243 2.20e-05 ***
## day_periodEvening            -4.227e-01  1.025e-01  -4.126 3.70e-05 ***
## day_periodNight              -4.198e-01  9.579e-02  -4.382 1.18e-05 ***
## day_year                      1.228e-04  2.354e-04   0.521  0.60209    
## year                         -9.052e-03  6.417e-03  -1.411  0.15833    
## week_dayTuesday              -3.710e-02  9.097e-02  -0.408  0.68343    
## week_dayWednesday             9.868e-02  8.966e-02   1.101  0.27105    
## week_dayThursday             -1.424e-02  9.026e-02  -0.158  0.87468    
## week_dayFriday                5.631e-02  8.695e-02   0.648  0.51721    
## week_daySaturday             -1.261e-01  8.511e-02  -1.481  0.13856    
## week_daySunday               -2.656e-02  8.442e-02  -0.315  0.75301    
## COVID_lockdownTRUE           -8.553e-02  1.308e-01  -0.654  0.51307    
## COVID_pandemicTRUE            7.755e-02  8.769e-02   0.884  0.37654    
## working_hourTRUE             -5.431e-02  8.995e-02  -0.604  0.54599    
## Latitude                      1.967e+00  1.037e+00   1.897  0.05783 .  
## Longitude                    -1.398e+00  1.129e+00  -1.238  0.21567    
## city_locationN_Manhattan     -2.087e-01  1.730e-01  -1.207  0.22751    
## city_locationW_Bronx          2.593e-02  1.914e-01   0.135  0.89223    
## city_locationE_Bronx         -2.007e-02  2.379e-01  -0.084  0.93277    
## city_locationN_E_Queens       2.618e-01  2.869e-01   0.912  0.36151    
## city_locationN_W_Queens       8.406e-02  2.001e-01   0.420  0.67442    
## city_locationS_Queens         4.580e-01  2.956e-01   1.550  0.12125    
## city_locationN_Brooklyn       1.269e-01  1.609e-01   0.789  0.43026    
## city_locationS_E_Brooklyn     2.283e-01  1.919e-01   1.190  0.23416    
## city_locationS_W_Brooklyn     2.347e-01  2.041e-01   1.150  0.25015    
## city_locationW_Staten_Island -4.238e-01  3.837e-01  -1.105  0.26934    
## city_locationE_Staten_Island -2.363e-02  2.413e-01  -0.098  0.92199    
## perp_age18-24                 1.351e-01  8.353e-02   1.618  0.10575    
## perp_age25-44                 4.083e-01  8.540e-02   4.781 1.74e-06 ***
## perp_age45+                   6.788e-01  1.270e-01   5.343 9.16e-08 ***
## vic_age18-24                  2.752e-01  8.688e-02   3.168  0.00154 ** 
## vic_age25-44                  3.470e-01  8.680e-02   3.998 6.40e-05 ***
## vic_age45+                    3.084e-01  1.121e-01   2.751  0.00595 ** 
## perp_sexM                    -4.223e-02  1.278e-01  -0.330  0.74113    
## vic_sexM                     -2.028e-02  7.069e-02  -0.287  0.77423    
## perp_raceBLACK               -5.226e-01  1.340e-01  -3.900 9.62e-05 ***
## perp_raceBLACK HISPANIC      -6.411e-01  1.535e-01  -4.175 2.97e-05 ***
## perp_raceWHITE HISPANIC      -4.881e-01  1.413e-01  -3.455  0.00055 ***
## vic_raceBLACK                -7.186e-02  1.113e-01  -0.646  0.51853    
## vic_raceBLACK HISPANIC       -3.365e-01  1.310e-01  -2.570  0.01017 *  
## vic_raceWHITE HISPANIC       -9.342e-02  1.191e-01  -0.784  0.43287    
## other_victims                 1.347e-01  9.960e-03  13.520  < 2e-16 ***
## jurisdictionHOUSING          -1.909e-01  6.724e-02  -2.839  0.00453 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12203  on 11097  degrees of freedom
## Residual deviance: 11796  on 11052  degrees of freedom
## AIC: 11888
## 
## Number of Fisher Scoring iterations: 4

As we can see:

  • the day period is strongly significant for the victim survival.

  • the day of the year, the year, the day of the week, the COVID periods and the working hour are not significant.

  • for some reason Latitude is slightly significant but Longitude is not.

  • the city location is not significant at all.

  • the perpetrator age is significant.

  • the victim age is significant as expected.

  • sex predictors for both perpetrator and victim are not significant.

  • perpetrator race predictors is significant, while victim race is slightly significant.

  • the presence of addiction victims is strongly significant.

  • the jurisdiction is strongly significant.

First check multicollinearity:

vif(glm.full)
##                      GVIF Df GVIF^(1/(2*Df))
## day_period       2.325583  5        1.088061
## day_year         1.012866  1        1.006412
## year             2.214926  1        1.488263
## week_day         1.308980  6        1.022691
## COVID_lockdown   1.205156  1        1.097796
## COVID_pandemic   2.412109  1        1.553097
## working_hour     2.352406  1        1.533756
## Latitude        16.666168  1        4.082422
## Longitude       12.048878  1        3.471149
## city_location  223.957441 11        1.278868
## perp_age         1.263237  3        1.039715
## vic_age          1.271971  3        1.040909
## perp_sex         1.011028  1        1.005499
## vic_sex          1.050139  1        1.024763
## perp_race        1.663579  3        1.088530
## vic_race         1.670686  3        1.089304
## other_victims    1.068994  1        1.033922
## jurisdiction     1.060846  1        1.029974

As we can see city_location predictors has an extremely large VIF and it is also not significant, we should remove it.

glm.full <- update(glm.full, .~.-city_location)
vif(glm.full)
##                    GVIF Df GVIF^(1/(2*Df))
## day_period     2.281159  5        1.085964
## day_year       1.010534  1        1.005253
## year           2.199714  1        1.483143
## week_day       1.294785  6        1.021762
## COVID_lockdown 1.200298  1        1.095581
## COVID_pandemic 2.396762  1        1.548148
## working_hour   2.347309  1        1.532093
## Latitude       1.167251  1        1.080394
## Longitude      1.069447  1        1.034141
## perp_age       1.255338  3        1.038628
## vic_age        1.264817  3        1.039931
## perp_sex       1.008670  1        1.004326
## vic_sex        1.048185  1        1.023809
## perp_race      1.600571  3        1.081548
## vic_race       1.624580  3        1.084235
## other_victims  1.064209  1        1.031605
## jurisdiction   1.029270  1        1.014530

Now we check for influential points:

influenceIndexPlot(glm.full, vars = "C")

The cook’s distance plot does not give indication of influential points.

Now let’s create a model with only significant predictors:

glm.sig <- glm(murder_prob ~ day_period + perp_age + vic_age + perp_race+ vic_race + other_victims + jurisdiction, data=shootings.train, family=binomial)
summary(glm.sig)
## 
## Call:
## glm(formula = murder_prob ~ day_period + perp_age + vic_age + 
##     perp_race + vic_race + other_victims + jurisdiction, family = binomial, 
##     data = shootings.train)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -0.879412   0.183232  -4.799 1.59e-06 ***
## day_periodMorning        -0.248163   0.123962  -2.002 0.045293 *  
## day_periodEarlyAfternoon -0.402931   0.113475  -3.551 0.000384 ***
## day_periodAfternoon      -0.436644   0.101897  -4.285 1.83e-05 ***
## day_periodEvening        -0.390479   0.101079  -3.863 0.000112 ***
## day_periodNight          -0.400965   0.095378  -4.204 2.62e-05 ***
## perp_age18-24             0.132067   0.083237   1.587 0.112592    
## perp_age25-44             0.404154   0.085004   4.755 1.99e-06 ***
## perp_age45+               0.673634   0.126180   5.339 9.36e-08 ***
## vic_age18-24              0.270305   0.086313   3.132 0.001738 ** 
## vic_age25-44              0.330290   0.085939   3.843 0.000121 ***
## vic_age45+                0.296612   0.111439   2.662 0.007776 ** 
## perp_raceBLACK           -0.525160   0.132352  -3.968 7.25e-05 ***
## perp_raceBLACK HISPANIC  -0.616717   0.151154  -4.080 4.50e-05 ***
## perp_raceWHITE HISPANIC  -0.466753   0.139693  -3.341 0.000834 ***
## vic_raceBLACK            -0.067016   0.109650  -0.611 0.541079    
## vic_raceBLACK HISPANIC   -0.311194   0.128657  -2.419 0.015573 *  
## vic_raceWHITE HISPANIC   -0.072053   0.117215  -0.615 0.538751    
## other_victims             0.135840   0.009802  13.859  < 2e-16 ***
## jurisdictionHOUSING      -0.206270   0.065742  -3.138 0.001703 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12203  on 11097  degrees of freedom
## Residual deviance: 11823  on 11078  degrees of freedom
## AIC: 11863
## 
## Number of Fisher Scoring iterations: 4
AIC(glm.sig,glm.full)
##          df      AIC
## glm.sig  20 11863.08
## glm.full 35 11877.18

The AIC indicates that glm.sig is better than the full model. Also the estimate coefficients are more interpretable.

Now we check for influential points:

influenceIndexPlot(glm.sig, vars = "C")

The cook’s distance plot does not give indication of influential points.

4.1 Interaction terms

Now let’s take a look at some interactions between victim and perpetrator. In particular we add to glm.sig the following interactions:

  • perpetrator age and victim age
  • perpetrator race and victim race
glm.sign.inter <- update(glm.sig, . ~ . + perp_age:vic_age + perp_race:vic_race)
summary(glm.sign.inter)
## 
## Call:
## glm(formula = murder_prob ~ day_period + perp_age + vic_age + 
##     perp_race + vic_race + other_victims + jurisdiction + perp_age:vic_age + 
##     perp_race:vic_race, family = binomial, data = shootings.train)
## 
## Coefficients:
##                                                 Estimate Std. Error z value
## (Intercept)                                    -0.613509   0.228981  -2.679
## day_periodMorning                              -0.249730   0.124689  -2.003
## day_periodEarlyAfternoon                       -0.403912   0.114039  -3.542
## day_periodAfternoon                            -0.439770   0.102440  -4.293
## day_periodEvening                              -0.399003   0.101668  -3.925
## day_periodNight                                -0.409595   0.095936  -4.269
## perp_age18-24                                   0.208068   0.178197   1.168
## perp_age25-44                                   0.379735   0.221629   1.713
## perp_age45+                                     0.691423   0.602813   1.147
## vic_age18-24                                    0.197264   0.186343   1.059
## vic_age25-44                                    0.575041   0.197407   2.913
## vic_age45+                                      0.274616   0.353817   0.776
## perp_raceBLACK                                 -1.029365   0.217937  -4.723
## perp_raceBLACK HISPANIC                        -1.624632   0.508655  -3.194
## perp_raceWHITE HISPANIC                        -0.824480   0.270888  -3.044
## vic_raceBLACK                                  -0.984678   0.327303  -3.008
## vic_raceBLACK HISPANIC                         -1.261441   0.524508  -2.405
## vic_raceWHITE HISPANIC                         -0.575297   0.323062  -1.781
## other_victims                                   0.137799   0.009864  13.970
## jurisdictionHOUSING                            -0.212252   0.065955  -3.218
## perp_age18-24:vic_age18-24                      0.044046   0.221089   0.199
## perp_age25-44:vic_age18-24                      0.163322   0.260440   0.627
## perp_age45+:vic_age18-24                        0.137680   0.678234   0.203
## perp_age18-24:vic_age25-44                     -0.273515   0.231128  -1.183
## perp_age25-44:vic_age25-44                     -0.179458   0.263781  -0.680
## perp_age45+:vic_age25-44                       -0.389557   0.632589  -0.616
## perp_age18-24:vic_age45+                       -0.180258   0.398888  -0.452
## perp_age25-44:vic_age45+                        0.083334   0.408223   0.204
## perp_age45+:vic_age45+                          0.237229   0.702978   0.337
## perp_raceBLACK:vic_raceBLACK                    1.122202   0.363885   3.084
## perp_raceBLACK HISPANIC:vic_raceBLACK           1.638309   0.597946   2.740
## perp_raceWHITE HISPANIC:vic_raceBLACK           1.013128   0.408489   2.480
## perp_raceBLACK:vic_raceBLACK HISPANIC           1.202486   0.557728   2.156
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC  1.647694   0.733303   2.247
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC  0.974565   0.587972   1.658
## perp_raceBLACK:vic_raceWHITE HISPANIC           0.713939   0.367815   1.941
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC  1.265044   0.599812   2.109
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC  0.543712   0.402118   1.352
##                                                Pr(>|z|)    
## (Intercept)                                    0.007378 ** 
## day_periodMorning                              0.045196 *  
## day_periodEarlyAfternoon                       0.000397 ***
## day_periodAfternoon                            1.76e-05 ***
## day_periodEvening                              8.69e-05 ***
## day_periodNight                                1.96e-05 ***
## perp_age18-24                                  0.242955    
## perp_age25-44                                  0.086642 .  
## perp_age45+                                    0.251384    
## vic_age18-24                                   0.289778    
## vic_age25-44                                   0.003580 ** 
## vic_age45+                                     0.437660    
## perp_raceBLACK                                 2.32e-06 ***
## perp_raceBLACK HISPANIC                        0.001403 ** 
## perp_raceWHITE HISPANIC                        0.002338 ** 
## vic_raceBLACK                                  0.002626 ** 
## vic_raceBLACK HISPANIC                         0.016173 *  
## vic_raceWHITE HISPANIC                         0.074951 .  
## other_victims                                   < 2e-16 ***
## jurisdictionHOUSING                            0.001290 ** 
## perp_age18-24:vic_age18-24                     0.842090    
## perp_age25-44:vic_age18-24                     0.530595    
## perp_age45+:vic_age18-24                       0.839137    
## perp_age18-24:vic_age25-44                     0.236655    
## perp_age25-44:vic_age25-44                     0.496297    
## perp_age45+:vic_age25-44                       0.538018    
## perp_age18-24:vic_age45+                       0.651340    
## perp_age25-44:vic_age45+                       0.838246    
## perp_age45+:vic_age45+                         0.735768    
## perp_raceBLACK:vic_raceBLACK                   0.002043 ** 
## perp_raceBLACK HISPANIC:vic_raceBLACK          0.006146 ** 
## perp_raceWHITE HISPANIC:vic_raceBLACK          0.013132 *  
## perp_raceBLACK:vic_raceBLACK HISPANIC          0.031080 *  
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 0.024643 *  
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.097418 .  
## perp_raceBLACK:vic_raceWHITE HISPANIC          0.052255 .  
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 0.034939 *  
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.176337    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12203  on 11097  degrees of freedom
## Residual deviance: 11799  on 11060  degrees of freedom
## AIC: 11875
## 
## Number of Fisher Scoring iterations: 4

As we can see the interaction between perpetrator race and victims race is slightly significant, while between perpetrator age and victim age is not. Let’s remove it.

glm.sign.inter <- update(glm.sign.inter, . ~ . - perp_age:vic_age)
summary(glm.sign.inter)
## 
## Call:
## glm(formula = murder_prob ~ day_period + perp_age + vic_age + 
##     perp_race + vic_race + other_victims + jurisdiction + perp_race:vic_race, 
##     family = binomial, data = shootings.train)
## 
## Coefficients:
##                                                 Estimate Std. Error z value
## (Intercept)                                    -0.559980   0.202144  -2.770
## day_periodMorning                              -0.258822   0.124399  -2.081
## day_periodEarlyAfternoon                       -0.410170   0.113749  -3.606
## day_periodAfternoon                            -0.444091   0.102223  -4.344
## day_periodEvening                              -0.404341   0.101427  -3.987
## day_periodNight                                -0.412883   0.095703  -4.314
## perp_age18-24                                   0.131995   0.083283   1.585
## perp_age25-44                                   0.406821   0.085074   4.782
## perp_age45+                                     0.659310   0.126704   5.204
## vic_age18-24                                    0.276794   0.086405   3.203
## vic_age25-44                                    0.339827   0.086071   3.948
## vic_age45+                                      0.298527   0.111728   2.672
## perp_raceBLACK                                 -1.052892   0.217318  -4.845
## perp_raceBLACK HISPANIC                        -1.624995   0.508221  -3.197
## perp_raceWHITE HISPANIC                        -0.836254   0.269996  -3.097
## vic_raceBLACK                                  -0.995607   0.326765  -3.047
## vic_raceBLACK HISPANIC                         -1.286057   0.524276  -2.453
## vic_raceWHITE HISPANIC                         -0.578081   0.322142  -1.794
## other_victims                                   0.136781   0.009828  13.918
## jurisdictionHOUSING                            -0.208539   0.065880  -3.165
## perp_raceBLACK:vic_raceBLACK                    1.141862   0.363324   3.143
## perp_raceBLACK HISPANIC:vic_raceBLACK           1.629987   0.597472   2.728
## perp_raceWHITE HISPANIC:vic_raceBLACK           1.020911   0.407687   2.504
## perp_raceBLACK:vic_raceBLACK HISPANIC           1.236925   0.557397   2.219
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC  1.667519   0.732951   2.275
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC  0.996136   0.587541   1.695
## perp_raceBLACK:vic_raceWHITE HISPANIC           0.723782   0.366942   1.972
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC  1.254745   0.599077   2.094
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC  0.544767   0.401143   1.358
##                                                Pr(>|z|)    
## (Intercept)                                    0.005602 ** 
## day_periodMorning                              0.037472 *  
## day_periodEarlyAfternoon                       0.000311 ***
## day_periodAfternoon                            1.40e-05 ***
## day_periodEvening                              6.71e-05 ***
## day_periodNight                                1.60e-05 ***
## perp_age18-24                                  0.112991    
## perp_age25-44                                  1.74e-06 ***
## perp_age45+                                    1.95e-07 ***
## vic_age18-24                                   0.001358 ** 
## vic_age25-44                                   7.87e-05 ***
## vic_age45+                                     0.007542 ** 
## perp_raceBLACK                                 1.27e-06 ***
## perp_raceBLACK HISPANIC                        0.001387 ** 
## perp_raceWHITE HISPANIC                        0.001953 ** 
## vic_raceBLACK                                  0.002312 ** 
## vic_raceBLACK HISPANIC                         0.014167 *  
## vic_raceWHITE HISPANIC                         0.072735 .  
## other_victims                                   < 2e-16 ***
## jurisdictionHOUSING                            0.001548 ** 
## perp_raceBLACK:vic_raceBLACK                   0.001673 ** 
## perp_raceBLACK HISPANIC:vic_raceBLACK          0.006369 ** 
## perp_raceWHITE HISPANIC:vic_raceBLACK          0.012274 *  
## perp_raceBLACK:vic_raceBLACK HISPANIC          0.026479 *  
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 0.022901 *  
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.089993 .  
## perp_raceBLACK:vic_raceWHITE HISPANIC          0.048556 *  
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 0.036219 *  
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.174452    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12203  on 11097  degrees of freedom
## Residual deviance: 11806  on 11069  degrees of freedom
## AIC: 11864
## 
## Number of Fisher Scoring iterations: 4
AIC(glm.sign.inter, glm.sig)
##                df      AIC
## glm.sign.inter 29 11864.46
## glm.sig        20 11863.08

Since the coefficient of the interaction are significant and the AIC increase is negligible (only one unit) we prefer the model with interactions.

Now let’s try adding interaction terms to the full model. Now let’s add the following interactions:

  • perpetrator age and victim age
  • perpetrator race and victim race
  • perpetrator sex and victim sex
  • year and day of the year
glm.full.inter <- update(glm.full, . ~ . + perp_race:vic_race + perp_age:vic_age + perp_sex:vic_sex + year:day_year)
summary(glm.full.inter)
## 
## Call:
## glm(formula = murder_prob ~ day_period + day_year + year + week_day + 
##     COVID_lockdown + COVID_pandemic + working_hour + Latitude + 
##     Longitude + perp_age + vic_age + perp_sex + vic_sex + perp_race + 
##     vic_race + other_victims + jurisdiction + perp_race:vic_race + 
##     perp_age:vic_age + perp_sex:vic_sex + day_year:year, family = binomial, 
##     data = shootings.train)
## 
## Coefficients:
##                                                  Estimate Std. Error z value
## (Intercept)                                    -2.107e+00  3.625e+01  -0.058
## day_periodMorning                              -2.345e-01  1.391e-01  -1.686
## day_periodEarlyAfternoon                       -3.812e-01  1.297e-01  -2.938
## day_periodAfternoon                            -4.435e-01  1.061e-01  -4.182
## day_periodEvening                              -4.218e-01  1.029e-01  -4.100
## day_periodNight                                -4.167e-01  9.624e-02  -4.329
## day_year                                        1.139e-01  9.206e-02   1.237
## year                                            2.336e-03  1.065e-02   0.219
## week_dayTuesday                                -4.600e-02  9.113e-02  -0.505
## week_dayWednesday                               1.046e-01  8.979e-02   1.164
## week_dayThursday                               -1.110e-02  9.044e-02  -0.123
## week_dayFriday                                  6.615e-02  8.704e-02   0.760
## week_daySaturday                               -1.219e-01  8.518e-02  -1.431
## week_daySunday                                 -2.592e-02  8.444e-02  -0.307
## COVID_lockdownTRUE                             -7.062e-02  1.307e-01  -0.540
## COVID_pandemicTRUE                              6.581e-02  8.763e-02   0.751
## working_hourTRUE                               -5.089e-02  9.003e-02  -0.565
## Latitude                                        4.827e-01  2.763e-01   1.747
## Longitude                                       3.145e-01  3.360e-01   0.936
## perp_age18-24                                   2.100e-01  1.784e-01   1.177
## perp_age25-44                                   3.740e-01  2.220e-01   1.684
## perp_age45+                                     6.880e-01  6.042e-01   1.139
## vic_age18-24                                    2.052e-01  1.868e-01   1.098
## vic_age25-44                                    5.792e-01  1.979e-01   2.927
## vic_age45+                                      2.470e-01  3.546e-01   0.696
## perp_sexM                                       3.946e-01  3.376e-01   1.169
## vic_sexM                                        4.819e-01  3.578e-01   1.347
## perp_raceBLACK                                 -1.047e+00  2.195e-01  -4.770
## perp_raceBLACK HISPANIC                        -1.662e+00  5.100e-01  -3.258
## perp_raceWHITE HISPANIC                        -8.420e-01  2.721e-01  -3.094
## vic_raceBLACK                                  -1.003e+00  3.287e-01  -3.050
## vic_raceBLACK HISPANIC                         -1.278e+00  5.253e-01  -2.432
## vic_raceWHITE HISPANIC                         -6.147e-01  3.252e-01  -1.890
## other_victims                                   1.369e-01  1.002e-02  13.653
## jurisdictionHOUSING                            -2.050e-01  6.641e-02  -3.086
## perp_raceBLACK:vic_raceBLACK                    1.145e+00  3.655e-01   3.134
## perp_raceBLACK HISPANIC:vic_raceBLACK           1.652e+00  5.993e-01   2.757
## perp_raceWHITE HISPANIC:vic_raceBLACK           1.010e+00  4.097e-01   2.464
## perp_raceBLACK:vic_raceBLACK HISPANIC           1.204e+00  5.587e-01   2.156
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC  1.645e+00  7.345e-01   2.239
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC  9.734e-01  5.888e-01   1.653
## perp_raceBLACK:vic_raceWHITE HISPANIC           7.349e-01  3.692e-01   1.990
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC  1.289e+00  6.014e-01   2.143
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC  5.662e-01  4.037e-01   1.402
## perp_age18-24:vic_age18-24                      4.108e-02  2.214e-01   0.186
## perp_age25-44:vic_age18-24                      1.672e-01  2.609e-01   0.641
## perp_age45+:vic_age18-24                        1.612e-01  6.800e-01   0.237
## perp_age18-24:vic_age25-44                     -2.667e-01  2.315e-01  -1.152
## perp_age25-44:vic_age25-44                     -1.653e-01  2.644e-01  -0.625
## perp_age45+:vic_age25-44                       -3.765e-01  6.340e-01  -0.594
## perp_age18-24:vic_age45+                       -1.657e-01  3.998e-01  -0.414
## perp_age25-44:vic_age45+                        1.290e-01  4.095e-01   0.315
## perp_age45+:vic_age45+                          2.889e-01  7.046e-01   0.410
## perp_sexM:vic_sexM                             -5.145e-01  3.650e-01  -1.409
## day_year:year                                  -5.650e-05  4.572e-05  -1.236
##                                                Pr(>|z|)    
## (Intercept)                                     0.95366    
## day_periodMorning                               0.09185 .  
## day_periodEarlyAfternoon                        0.00330 ** 
## day_periodAfternoon                            2.89e-05 ***
## day_periodEvening                              4.13e-05 ***
## day_periodNight                                1.49e-05 ***
## day_year                                        0.21606    
## year                                            0.82631    
## week_dayTuesday                                 0.61367    
## week_dayWednesday                               0.24422    
## week_dayThursday                                0.90233    
## week_dayFriday                                  0.44725    
## week_daySaturday                                0.15239    
## week_daySunday                                  0.75889    
## COVID_lockdownTRUE                              0.58899    
## COVID_pandemicTRUE                              0.45261    
## working_hourTRUE                                0.57192    
## Latitude                                        0.08059 .  
## Longitude                                       0.34931    
## perp_age18-24                                   0.23935    
## perp_age25-44                                   0.09210 .  
## perp_age45+                                     0.25478    
## vic_age18-24                                    0.27214    
## vic_age25-44                                    0.00342 ** 
## vic_age45+                                      0.48612    
## perp_sexM                                       0.24257    
## vic_sexM                                        0.17809    
## perp_raceBLACK                                 1.84e-06 ***
## perp_raceBLACK HISPANIC                         0.00112 ** 
## perp_raceWHITE HISPANIC                         0.00197 ** 
## vic_raceBLACK                                   0.00229 ** 
## vic_raceBLACK HISPANIC                          0.01501 *  
## vic_raceWHITE HISPANIC                          0.05873 .  
## other_victims                                   < 2e-16 ***
## jurisdictionHOUSING                             0.00203 ** 
## perp_raceBLACK:vic_raceBLACK                    0.00172 ** 
## perp_raceBLACK HISPANIC:vic_raceBLACK           0.00583 ** 
## perp_raceWHITE HISPANIC:vic_raceBLACK           0.01372 *  
## perp_raceBLACK:vic_raceBLACK HISPANIC           0.03110 *  
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC  0.02513 *  
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC  0.09830 .  
## perp_raceBLACK:vic_raceWHITE HISPANIC           0.04656 *  
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC  0.03213 *  
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC  0.16077    
## perp_age18-24:vic_age18-24                      0.85279    
## perp_age25-44:vic_age18-24                      0.52172    
## perp_age45+:vic_age18-24                        0.81259    
## perp_age18-24:vic_age25-44                      0.24922    
## perp_age25-44:vic_age25-44                      0.53197    
## perp_age45+:vic_age25-44                        0.55264    
## perp_age18-24:vic_age45+                        0.67860    
## perp_age25-44:vic_age45+                        0.75265    
## perp_age45+:vic_age45+                          0.68180    
## perp_sexM:vic_sexM                              0.15874    
## day_year:year                                   0.21657    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12203  on 11097  degrees of freedom
## Residual deviance: 11779  on 11043  degrees of freedom
## AIC: 11889
## 
## Number of Fisher Scoring iterations: 4

As before the interaction between perpetrator race and victim race is significant, the interaction between perpetrator age and victim age is not significant. Furthermore the interaction between perpetrator sex and victim sex and the interaction between year and day of the year are not significant. Let’s remove the not significant interactions:

glm.full.inter <- update(glm.full.inter, . ~ . - perp_age:vic_age - perp_sex:vic_sex - year:day_year)
summary(glm.full.inter)
## 
## Call:
## glm(formula = murder_prob ~ day_period + day_year + year + week_day + 
##     COVID_lockdown + COVID_pandemic + working_hour + Latitude + 
##     Longitude + perp_age + vic_age + perp_sex + vic_sex + perp_race + 
##     vic_race + other_victims + jurisdiction + perp_race:vic_race, 
##     family = binomial, data = shootings.train)
## 
## Coefficients:
##                                                  Estimate Std. Error z value
## (Intercept)                                    20.7959594 31.8958533   0.652
## day_periodMorning                              -0.2456957  0.1387187  -1.771
## day_periodEarlyAfternoon                       -0.3911497  0.1293100  -3.025
## day_periodAfternoon                            -0.4521892  0.1057480  -4.276
## day_periodEvening                              -0.4292800  0.1025737  -4.185
## day_periodNight                                -0.4223721  0.0959247  -4.403
## day_year                                        0.0001051  0.0002355   0.446
## year                                           -0.0085092  0.0064051  -1.329
## week_dayTuesday                                -0.0441346  0.0909882  -0.485
## week_dayWednesday                               0.0981691  0.0896198   1.095
## week_dayThursday                               -0.0124163  0.0902670  -0.138
## week_dayFriday                                  0.0574862  0.0868911   0.662
## week_daySaturday                               -0.1250365  0.0850505  -1.470
## week_daySunday                                 -0.0287566  0.0843183  -0.341
## COVID_lockdownTRUE                             -0.0798501  0.1305400  -0.612
## COVID_pandemicTRUE                              0.0681454  0.0874755   0.779
## working_hourTRUE                               -0.0498608  0.0898993  -0.555
## Latitude                                        0.4716384  0.2759198   1.709
## Longitude                                       0.3163084  0.3355788   0.943
## perp_age18-24                                   0.1349025  0.0834756   1.616
## perp_age25-44                                   0.4110874  0.0853446   4.817
## perp_age45+                                     0.6673513  0.1272876   5.243
## vic_age18-24                                    0.2796371  0.0868504   3.220
## vic_age25-44                                    0.3500667  0.0867783   4.034
## vic_age45+                                      0.3038834  0.1121759   2.709
## perp_sexM                                      -0.0375984  0.1280062  -0.294
## vic_sexM                                       -0.0147228  0.0707723  -0.208
## perp_raceBLACK                                 -1.0615720  0.2187197  -4.854
## perp_raceBLACK HISPANIC                        -1.6440918  0.5094168  -3.227
## perp_raceWHITE HISPANIC                        -0.8465790  0.2711553  -3.122
## vic_raceBLACK                                  -1.0061478  0.3279735  -3.068
## vic_raceBLACK HISPANIC                         -1.2903632  0.5251927  -2.457
## vic_raceWHITE HISPANIC                         -0.5948548  0.3235285  -1.839
## other_victims                                   0.1368425  0.0099637  13.734
## jurisdictionHOUSING                            -0.2017840  0.0663307  -3.042
## perp_raceBLACK:vic_raceBLACK                    1.1495145  0.3646584   3.152
## perp_raceBLACK HISPANIC:vic_raceBLACK           1.6223876  0.5986164   2.710
## perp_raceWHITE HISPANIC:vic_raceBLACK           1.0030521  0.4086859   2.454
## perp_raceBLACK:vic_raceBLACK HISPANIC           1.2198175  0.5584145   2.184
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC  1.6372517  0.7340633   2.230
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC  0.9769613  0.5884792   1.660
## perp_raceBLACK:vic_raceWHITE HISPANIC           0.7157150  0.3676400   1.947
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC  1.2428665  0.6000116   2.071
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC  0.5385956  0.4020950   1.339
##                                                Pr(>|z|)    
## (Intercept)                                     0.51440    
## day_periodMorning                               0.07653 .  
## day_periodEarlyAfternoon                        0.00249 ** 
## day_periodAfternoon                            1.90e-05 ***
## day_periodEvening                              2.85e-05 ***
## day_periodNight                                1.07e-05 ***
## day_year                                        0.65546    
## year                                            0.18401    
## week_dayTuesday                                 0.62763    
## week_dayWednesday                               0.27334    
## week_dayThursday                                0.89060    
## week_dayFriday                                  0.50823    
## week_daySaturday                                0.14152    
## week_daySunday                                  0.73307    
## COVID_lockdownTRUE                              0.54074    
## COVID_pandemicTRUE                              0.43597    
## working_hourTRUE                                0.57915    
## Latitude                                        0.08739 .  
## Longitude                                       0.34590    
## perp_age18-24                                   0.10608    
## perp_age25-44                                  1.46e-06 ***
## perp_age45+                                    1.58e-07 ***
## vic_age18-24                                    0.00128 ** 
## vic_age25-44                                   5.48e-05 ***
## vic_age45+                                      0.00675 ** 
## perp_sexM                                       0.76897    
## vic_sexM                                        0.83521    
## perp_raceBLACK                                 1.21e-06 ***
## perp_raceBLACK HISPANIC                         0.00125 ** 
## perp_raceWHITE HISPANIC                         0.00180 ** 
## vic_raceBLACK                                   0.00216 ** 
## vic_raceBLACK HISPANIC                          0.01401 *  
## vic_raceWHITE HISPANIC                          0.06597 .  
## other_victims                                   < 2e-16 ***
## jurisdictionHOUSING                             0.00235 ** 
## perp_raceBLACK:vic_raceBLACK                    0.00162 ** 
## perp_raceBLACK HISPANIC:vic_raceBLACK           0.00672 ** 
## perp_raceWHITE HISPANIC:vic_raceBLACK           0.01411 *  
## perp_raceBLACK:vic_raceBLACK HISPANIC           0.02893 *  
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC  0.02572 *  
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC  0.09689 .  
## perp_raceBLACK:vic_raceWHITE HISPANIC           0.05156 .  
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC  0.03832 *  
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC  0.18042    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12203  on 11097  degrees of freedom
## Residual deviance: 11791  on 11054  degrees of freedom
## AIC: 11879
## 
## Number of Fisher Scoring iterations: 4

Check AIC:

AIC(glm.full)
## [1] 11877.18
AIC(glm.full.inter)
## [1] 11878.91

As before the AIC increase is negligible and the interaction between perpetrator and victim races is significant, thus we prefer this model compared to the full model.

In summary the models with interaction are preferable. Now let’s compare the two models with interaction:

AIC(glm.full.inter)
## [1] 11878.91
AIC(glm.sign.inter)
## [1] 11864.46

In terms of AIC we prefer glm.sign.inter: the model with only significant predictors and the interaction term.

Let’s interpret what the best model in terms of AIC tell us.

summary(glm.sign.inter)
## 
## Call:
## glm(formula = murder_prob ~ day_period + perp_age + vic_age + 
##     perp_race + vic_race + other_victims + jurisdiction + perp_race:vic_race, 
##     family = binomial, data = shootings.train)
## 
## Coefficients:
##                                                 Estimate Std. Error z value
## (Intercept)                                    -0.559980   0.202144  -2.770
## day_periodMorning                              -0.258822   0.124399  -2.081
## day_periodEarlyAfternoon                       -0.410170   0.113749  -3.606
## day_periodAfternoon                            -0.444091   0.102223  -4.344
## day_periodEvening                              -0.404341   0.101427  -3.987
## day_periodNight                                -0.412883   0.095703  -4.314
## perp_age18-24                                   0.131995   0.083283   1.585
## perp_age25-44                                   0.406821   0.085074   4.782
## perp_age45+                                     0.659310   0.126704   5.204
## vic_age18-24                                    0.276794   0.086405   3.203
## vic_age25-44                                    0.339827   0.086071   3.948
## vic_age45+                                      0.298527   0.111728   2.672
## perp_raceBLACK                                 -1.052892   0.217318  -4.845
## perp_raceBLACK HISPANIC                        -1.624995   0.508221  -3.197
## perp_raceWHITE HISPANIC                        -0.836254   0.269996  -3.097
## vic_raceBLACK                                  -0.995607   0.326765  -3.047
## vic_raceBLACK HISPANIC                         -1.286057   0.524276  -2.453
## vic_raceWHITE HISPANIC                         -0.578081   0.322142  -1.794
## other_victims                                   0.136781   0.009828  13.918
## jurisdictionHOUSING                            -0.208539   0.065880  -3.165
## perp_raceBLACK:vic_raceBLACK                    1.141862   0.363324   3.143
## perp_raceBLACK HISPANIC:vic_raceBLACK           1.629987   0.597472   2.728
## perp_raceWHITE HISPANIC:vic_raceBLACK           1.020911   0.407687   2.504
## perp_raceBLACK:vic_raceBLACK HISPANIC           1.236925   0.557397   2.219
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC  1.667519   0.732951   2.275
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC  0.996136   0.587541   1.695
## perp_raceBLACK:vic_raceWHITE HISPANIC           0.723782   0.366942   1.972
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC  1.254745   0.599077   2.094
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC  0.544767   0.401143   1.358
##                                                Pr(>|z|)    
## (Intercept)                                    0.005602 ** 
## day_periodMorning                              0.037472 *  
## day_periodEarlyAfternoon                       0.000311 ***
## day_periodAfternoon                            1.40e-05 ***
## day_periodEvening                              6.71e-05 ***
## day_periodNight                                1.60e-05 ***
## perp_age18-24                                  0.112991    
## perp_age25-44                                  1.74e-06 ***
## perp_age45+                                    1.95e-07 ***
## vic_age18-24                                   0.001358 ** 
## vic_age25-44                                   7.87e-05 ***
## vic_age45+                                     0.007542 ** 
## perp_raceBLACK                                 1.27e-06 ***
## perp_raceBLACK HISPANIC                        0.001387 ** 
## perp_raceWHITE HISPANIC                        0.001953 ** 
## vic_raceBLACK                                  0.002312 ** 
## vic_raceBLACK HISPANIC                         0.014167 *  
## vic_raceWHITE HISPANIC                         0.072735 .  
## other_victims                                   < 2e-16 ***
## jurisdictionHOUSING                            0.001548 ** 
## perp_raceBLACK:vic_raceBLACK                   0.001673 ** 
## perp_raceBLACK HISPANIC:vic_raceBLACK          0.006369 ** 
## perp_raceWHITE HISPANIC:vic_raceBLACK          0.012274 *  
## perp_raceBLACK:vic_raceBLACK HISPANIC          0.026479 *  
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 0.022901 *  
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.089993 .  
## perp_raceBLACK:vic_raceWHITE HISPANIC          0.048556 *  
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 0.036219 *  
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.174452    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12203  on 11097  degrees of freedom
## Residual deviance: 11806  on 11069  degrees of freedom
## AIC: 11864
## 
## Number of Fisher Scoring iterations: 4

Let’s plot the effects for all the interactionless variables:

a <- plot( effect("day_period", glm.sign.inter),rescale.axis=FALSE, ylab="Probability of murder")
b <- plot( effect("perp_age", glm.sign.inter), rescale.axis=FALSE, ylab="Probability of murder")
c <-plot( effect("vic_age", glm.sign.inter), rescale.axis=FALSE, ylab="Probability of murder")
d <-plot( effect("other_victims", glm.sign.inter), rescale.axis=FALSE, ylab="Probability of murder")
e <-plot( effect("jurisdiction", glm.sign.inter), rescale.axis=FALSE, ylab="Probability of murder")

grid.arrange(a,b,c,d,e , top=textGrob("Interactionless variables effect plots",gp=gpar(fontsize=20,font=3)))

As we can see:

  • during the early hours of the day it is less likely for the victim to survive a shooting incident.

  • the more the perpetrator is old, the less likely is for the victim to survive a shooting incident.

  • the more the victim is old, the less likely is for him/her to survive a shooting incident.

  • as the number of other victims increases, the less likely is for the victim to survive a shooting incident.

  • shootings incident with patrol jurisdiction have a higher probability of survival.

plot( effect("perp_race:vic_race", glm.sign.inter), rescale.axis=FALSE, ylab="Probability of murder")

As we can see if the victim is “Black Hispanic”, “White Hispanic” or “Black” the probability of survival doesn’t change much; while if the victim is “ASIAN” or “WHITE”, the probability of survival decreases if the shooter is also “ASIAN” or “WHITE.

4.2 Model comparison

Let’s compare all the models done so far in terms of prediction power to be able to compare them with other types of models:

glm.full.pred <- predict(glm.full, newdata = shootings.test, type = "response")

glm.sig.pred <- predict(glm.sig, newdata = shootings.test, type = "response")

glm.full.inter.pred <- predict(glm.full.inter, newdata = shootings.test, type = "response")

glm.sig.inter.pred <- predict(glm.sign.inter, newdata = shootings.test, type = "response")

Select the best threshold via ROC curve:

par(mfrow=c(2,2))

glm.full.roc <- roc(shootings.test$murder_prob ~ glm.full.pred, plot=TRUE, print.auc=TRUE, main="glm.full ROC curve")
glm.sig.roc <- roc(shootings.test$murder_prob ~ glm.sig.pred, plot=TRUE, print.auc=TRUE, main="glm.sig ROC curve")

glm.full.inter.roc <- roc(shootings.test$murder_prob ~ glm.full.inter.pred, plot=TRUE, print.auc=TRUE, main="glm.full.inter ROC curve")
glm.sig.inter.roc <- roc(shootings.test$murder_prob ~ glm.sig.inter.pred, plot=TRUE, print.auc=TRUE, main="glm.sign.inter ROC curve")

As we can see, the ROC curve is essentially the same for all the models.

glm.full.metrics <- coords(glm.full.roc, x="best", ret="all")
glm.sig.metrics <- coords(glm.sig.roc, x="best", ret="all")
glm.full.inter.metrics <- coords(glm.full.inter.roc, x="best", ret="all")
glm.sig.inter.metrics <- coords(glm.sig.inter.roc, x="best", ret="all")
row.names(glm.full.metrics) <- "Full model"
row.names(glm.sig.metrics) <- "Significant predictors model"
row.names(glm.full.inter.metrics) <- "Full model with interaction"
row.names(glm.sig.inter.metrics) <- "Significant predictors model with interaction"

metrics <- rbind(glm.full.metrics, glm.sig.metrics, glm.full.inter.metrics, glm.sig.inter.metrics)

Now let’s compare:

  1. specificity
(metrics %>% arrange(desc(specificity)))[, c("specificity", "sensitivity", "accuracy")]
##                                               specificity sensitivity  accuracy
## Significant predictors model with interaction   0.5421003   0.6323752 0.5636036
## Full model with interaction                     0.5397351   0.6323752 0.5618018
## Full model                                      0.5288553   0.6414523 0.5556757
## Significant predictors model                    0.5056764   0.6611195 0.5427027

The best model in terms of specificity is “Significant predictors model with interaction”.

  1. sensitivity
(metrics %>% arrange(desc(sensitivity)))[, c("specificity", "sensitivity", "accuracy")]
##                                               specificity sensitivity  accuracy
## Significant predictors model                    0.5056764   0.6611195 0.5427027
## Full model                                      0.5288553   0.6414523 0.5556757
## Full model with interaction                     0.5397351   0.6323752 0.5618018
## Significant predictors model with interaction   0.5421003   0.6323752 0.5636036

The best model in terms of sensitivity is “Significant predictors model”.

  1. accuracy
(metrics %>% arrange(desc(accuracy)))[, c("specificity", "sensitivity", "accuracy")]
##                                               specificity sensitivity  accuracy
## Significant predictors model with interaction   0.5421003   0.6323752 0.5636036
## Full model with interaction                     0.5397351   0.6323752 0.5618018
## Full model                                      0.5288553   0.6414523 0.5556757
## Significant predictors model                    0.5056764   0.6611195 0.5427027

The best model in terms of accuracy is “Significant predictors model with interaction”.

Since the response in unbalanced:

print(dfSummary(shootings.test[,"murder_prob"]), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 murder_prob [numeric]
Min : 0
Mean : 0.2
Max : 1
0:2114(76.2%)
1:661(23.8%)
2775 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

The best model in terms of prediction power should maximize its ability to correctly classify murders. For this reason i should choose the model with higher sensitivity and acceptable specificity (>0.5): “Full model”.

4.2.0.1 Confusion Matrixes

confusion_matrix <- function(pred_prob, threshold){
  pred_class <- pred_prob > threshold
  table(preds=pred_class, true=as.logical(shootings.test$murder_prob))
}
  1. Full model
glm.full.metrics[, c("specificity", "sensitivity", "accuracy")]
##            specificity sensitivity  accuracy
## Full model   0.5288553   0.6414523 0.5556757
confusion_matrix(glm.full.pred, glm.full.metrics$threshold)
##        true
## preds   FALSE TRUE
##   FALSE  1118  237
##   TRUE    996  424
  1. Full model with interaction
glm.full.inter.metrics[, c("specificity", "sensitivity", "accuracy")]
##                             specificity sensitivity  accuracy
## Full model with interaction   0.5397351   0.6323752 0.5618018
confusion_matrix(glm.full.inter.pred, glm.full.inter.metrics$threshold)
##        true
## preds   FALSE TRUE
##   FALSE  1141  243
##   TRUE    973  418
  1. Significant predictors model
glm.sig.metrics[, c("specificity", "sensitivity", "accuracy")]
##                              specificity sensitivity  accuracy
## Significant predictors model   0.5056764   0.6611195 0.5427027
confusion_matrix(glm.sig.pred, glm.sig.metrics$threshold)
##        true
## preds   FALSE TRUE
##   FALSE  1069  224
##   TRUE   1045  437
  1. Significant predictors model with interaction
glm.sig.inter.metrics[, c("specificity", "sensitivity", "accuracy")]
##                                               specificity sensitivity  accuracy
## Significant predictors model with interaction   0.5421003   0.6323752 0.5636036
confusion_matrix(glm.sig.inter.pred, glm.sig.inter.metrics$threshold)
##        true
## preds   FALSE TRUE
##   FALSE  1146  243
##   TRUE    968  418

4.3 Model selection

Now let’s apply automatic model selection to identity the best model.

4.3.1 Stepwiese logistic regression

We use the model which has the most predictors and interaction terms as starting point.

model <- glm(murder_prob ~ . + perp_race:vic_race + perp_age:vic_age + perp_sex:vic_sex + year:day_year, family = binomial, data = shootings.train)
glm.step <- step(model, direction = "both", trace = FALSE)
summary(glm.step)
## 
## Call:
## glm(formula = murder_prob ~ day_period + year + Latitude + perp_age + 
##     vic_age + perp_race + vic_race + other_victims + jurisdiction, 
##     family = binomial, data = shootings.train)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.787131  13.616581  -0.719 0.472285    
## day_periodMorning        -0.246401   0.123987  -1.987 0.046887 *  
## day_periodEarlyAfternoon -0.397900   0.113502  -3.506 0.000455 ***
## day_periodAfternoon      -0.436626   0.101885  -4.285 1.82e-05 ***
## day_periodEvening        -0.394104   0.101080  -3.899 9.66e-05 ***
## day_periodNight          -0.408608   0.095415  -4.282 1.85e-05 ***
## year                     -0.006386   0.004422  -1.444 0.148655    
## Latitude                  0.534628   0.267956   1.995 0.046020 *  
## perp_age18-24             0.132106   0.083252   1.587 0.112554    
## perp_age25-44             0.410781   0.085078   4.828 1.38e-06 ***
## perp_age45+               0.685533   0.126370   5.425 5.80e-08 ***
## vic_age18-24              0.274380   0.086357   3.177 0.001487 ** 
## vic_age25-44              0.339776   0.086189   3.942 8.07e-05 ***
## vic_age45+                0.307450   0.111640   2.754 0.005888 ** 
## perp_raceBLACK           -0.531496   0.132681  -4.006 6.18e-05 ***
## perp_raceBLACK HISPANIC  -0.640503   0.152269  -4.206 2.59e-05 ***
## perp_raceWHITE HISPANIC  -0.487890   0.140572  -3.471 0.000519 ***
## vic_raceBLACK            -0.074958   0.109876  -0.682 0.495111    
## vic_raceBLACK HISPANIC   -0.339954   0.129735  -2.620 0.008783 ** 
## vic_raceWHITE HISPANIC   -0.096896   0.118128  -0.820 0.412069    
## other_victims             0.134627   0.009818  13.712  < 2e-16 ***
## jurisdictionHOUSING      -0.205217   0.065771  -3.120 0.001808 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12203  on 11097  degrees of freedom
## Residual deviance: 11817  on 11076  degrees of freedom
## AIC: 11861
## 
## Number of Fisher Scoring iterations: 4
AIC(glm.full, glm.sig, glm.full.inter, glm.sign.inter, glm.step) %>% arrange(AIC)
##                df      AIC
## glm.step       22 11861.36
## glm.sig        20 11863.08
## glm.sign.inter 29 11864.46
## glm.full       35 11877.18
## glm.full.inter 44 11878.91

In terms of AIC the “Step model” is slightly better.

4.3.1.1 Model comparison

Let’s compare the step model with the models done so far.

glm.step.pred <- predict(glm.step, newdata = shootings.test, type = "response")
glm.step.roc <- roc(shootings.test$murder_prob ~ glm.step.pred, plot=TRUE, print.auc=TRUE, main="Step model ROC curve")

glm.step.metrics <- coords(glm.step.roc, x="best", ret="all")
rownames(glm.step.metrics) <- "Step model"
metrics <- rbind(metrics, glm.step.metrics)
(metrics %>% arrange(desc(sensitivity)))[, c("specificity", "sensitivity", "accuracy")]
##                                               specificity sensitivity  accuracy
## Step model                                      0.5132450   0.6641452 0.5491892
## Significant predictors model                    0.5056764   0.6611195 0.5427027
## Full model                                      0.5288553   0.6414523 0.5556757
## Full model with interaction                     0.5397351   0.6323752 0.5618018
## Significant predictors model with interaction   0.5421003   0.6323752 0.5636036

The “Step model” is slightly better in terms of sensitivity compared to the best overall model so far (“Full model”) and has an acceptable specificity: it is the new best overall model.

Confusion matrix for step model:

confusion_matrix(glm.step.pred, glm.step.metrics$threshold)
##        true
## preds   FALSE TRUE
##   FALSE  1085  222
##   TRUE   1029  439

4.3.2 Lasso Regression

Now we try with lasso regression. As before we use as starting point the formula with all predictors and interactions term.

lasso.mod <- glmnet(murder_prob ~ . + perp_race:vic_race + perp_age:vic_age + perp_sex:vic_sex + year:day_year, data=shootings.train, alpha=1, family = "binomial")
plot(lasso.mod, main='Path plot of the Lasso estimates\n\n')

We choose lambda using cross validation:

lasso.cv <- cv.glmnet(murder_prob ~ . + perp_race:vic_race + perp_age:vic_age + perp_sex:vic_sex + year:day_year, data=shootings.train, alpha=1, family = "binomial")
plot(lasso.cv)

Lasso gives us the choice between two values of lambda:

  • lambda.min: lambda of minimum mean cross-validated error.

  • lambda.1se: largest value of lambda such that error is within 1 standard error of the cross-validated errors for lambda.min.

We explore both paths.

4.3.2.1 1se lambda

plot(lasso.mod, xvar = "lambda", main='Path plot of the Lasso estimates with 1se lambda\n\n')
abline(v = log(lasso.cv$lambda.1se), lty="dashed")

The remaining coefficients are:

lasso.1se.coef <- coef(lasso.mod,  s=lasso.cv$lambda.1se)

lasso.1se.coef[lasso.1se.coef[,1]!=0,]
##                              (Intercept) 
##                              -1.33784349 
##                   day_periodEarlyMorning 
##                               0.07119012 
##                            perp_age25-44 
##                               0.10728935 
##                              perp_age45+ 
##                               0.16398864 
##                               vic_age<18 
##                              -0.02929429 
##                            other_victims 
##                               0.09101163 
## perp_raceASIAN/WHITE:vic_raceASIAN/WHITE 
##                               0.43398945

4.3.2.2 min lambda

plot(lasso.mod, xvar = "lambda", main='Path plot of the Lasso estimates with min lambda\n\n')
abline(v =log(lasso.cv$lambda.min), lty="dashed")

The remaining coefficients are:

lasso.min.coef <- coef(lasso.mod,  s=lasso.cv$lambda.min)
lasso.min.coef[lasso.min.coef[,1]!=0,]
##                                 (Intercept) 
##                               -7.425535e+00 
##                      day_periodEarlyMorning 
##                                3.345140e-01 
##                           day_periodMorning 
##                                5.401413e-02 
##                         day_periodAfternoon 
##                               -1.020057e-03 
##                           week_dayWednesday 
##                                3.365738e-02 
##                              week_dayFriday 
##                                1.617189e-04 
##                            week_daySaturday 
##                               -4.362882e-02 
##                                    Latitude 
##                                1.437363e-01 
##                    city_locationN_Manhattan 
##                               -7.263793e-02 
##                        city_locationW_Bronx 
##                                3.824675e-02 
##                city_locationW_Staten_Island 
##                               -1.489629e-01 
##                                 perp_age<18 
##                               -4.102680e-02 
##                               perp_age25-44 
##                                2.432676e-01 
##                                 perp_age45+ 
##                                3.707116e-01 
##                                  vic_age<18 
##                               -2.256458e-01 
##                                vic_age25-44 
##                                3.835733e-02 
##                     perp_raceBLACK HISPANIC 
##                               -3.554706e-02 
##                      vic_raceBLACK HISPANIC 
##                               -1.615043e-01 
##                               other_victims 
##                                1.259900e-01 
##                          jurisdictionPATROL 
##                                1.382466e-01 
##                         jurisdictionHOUSING 
##                               -2.125561e-13 
##    perp_raceASIAN/WHITE:vic_raceASIAN/WHITE 
##                                8.056032e-01 
## perp_raceBLACK HISPANIC:vic_raceASIAN/WHITE 
##                               -1.512426e-01 
## perp_raceASIAN/WHITE:vic_raceWHITE HISPANIC 
##                                9.485748e-02 
##                      perp_age<18:vic_age<18 
##                               -4.363924e-02 
##                    perp_age<18:vic_age18-24 
##                               -1.007613e-01 
##                    perp_age18-24:vic_age45+ 
##                               -1.783076e-03 
##                      perp_age45+:vic_age45+ 
##                                2.558898e-01

4.3.2.3 Model comparison

Now let’s compare prediction performance of lasso models.

lasso.1se.pred <- predict(lasso.mod, s=lasso.cv$lambda.1se, newdata = shootings.test, type = "response")
lasso.min.pred <- predict(lasso.mod, s=lasso.cv$lambda.min, newdata = shootings.test, type = "response")
par(mfrow=c(1,2))

lasso.1se.roc <- roc(shootings.test$murder_prob ~ lasso.1se.pred, plot=TRUE, print.auc=TRUE, main="Lasso 1se lambda ROC curve")
lasso.min.roc <- roc(shootings.test$murder_prob ~ lasso.min.pred, plot=TRUE, print.auc=TRUE, main="Lasso min lambda ROC curve")

According to the ROC curve the Lasso model with min lambda is slightly better than the one with 1se lambda.

lasso.1se.metrics <- coords(lasso.1se.roc, x="best", ret="all")
row.names(lasso.1se.metrics) <- "Lasso 1se lambda"

lasso.min.metrics <- coords(lasso.min.roc, x="best", ret="all")
row.names(lasso.min.metrics) <- "Lasso min lambda"
metrics <- rbind(metrics, lasso.1se.metrics, lasso.min.metrics)
(metrics %>% arrange(desc(sensitivity)))[, c("specificity", "sensitivity", "accuracy")]
##                                               specificity sensitivity  accuracy
## Lasso 1se lambda                                0.4550615   0.7049924 0.5145946
## Step model                                      0.5132450   0.6641452 0.5491892
## Significant predictors model                    0.5056764   0.6611195 0.5427027
## Full model                                      0.5288553   0.6414523 0.5556757
## Full model with interaction                     0.5397351   0.6323752 0.5618018
## Significant predictors model with interaction   0.5421003   0.6323752 0.5636036
## Lasso min lambda                                0.6636708   0.5158850 0.6284685

As we can see “Lasso 1se lambda” is the best model in terms of sensitivity now. Unfortunately it is also the worst model in terms of accuracy and specificity. The model “Lasso min lambda” is the worst model in terms of sensitivity but has the higher specificity and accuracy.

Confusion Matrix for lambda.1se:

confusion_matrix(lasso.1se.pred, lasso.1se.metrics$threshold)
##        true
## preds   FALSE TRUE
##   FALSE   962  195
##   TRUE   1152  466

Confusion Matrix for lambda.min:

confusion_matrix(lasso.min.pred, lasso.min.metrics$threshold)
##        true
## preds   FALSE TRUE
##   FALSE  1403  320
##   TRUE    711  341

4.3.3 Ridge logistic regression

Now we try with ridge regression. As before we use as starting point the formula with all predictors and interaction term.

ridge.mod <- glmnet(murder_prob ~ . + perp_race:vic_race + perp_age:vic_age + perp_sex:vic_sex + year:day_year, data=shootings.train, alpha=0, family = binomial)
plot(ridge.mod, main='Path plot of the Ridge estimates\n\n')

We choose lambda using cross validation:

ridge.cv <- cv.glmnet(murder_prob ~ . + perp_race:vic_race + perp_age:vic_age + perp_sex:vic_sex + year:day_year, data=shootings.train, alpha=0, family = binomial)
plot(ridge.cv)

Ridge (as Lasso) gives us the choice between two values of lambda:

  • lambda.min: lambda of minimum mean cross-validated error.

  • lambda.1se: largest value of lambda such that error is within 1 standard error of the cross-validated errors for lambda.min.

We explore both paths.

4.3.3.1 1se lambda

plot(ridge.mod, xvar = "lambda", main='Path plot of the Ridge estimates with 1se lambda\n\n')
abline(v =log(ridge.cv$lambda.1se), lty="dashed")

ridge.1se.coef <- coef(ridge.mod,  s=ridge.cv$lambda.1se)
ridge.1se.coef
## 96 x 1 sparse Matrix of class "dgCMatrix"
##                                                           s1
## (Intercept)                                     3.845321e+00
## day_periodEarlyMorning                          1.388384e-01
## day_periodMorning                               3.401170e-02
## day_periodEarlyAfternoon                       -1.462631e-02
## day_periodAfternoon                            -2.557651e-02
## day_periodEvening                              -8.282254e-03
## day_periodNight                                -1.052521e-02
## day_year                                        3.089940e-05
## year                                           -1.302723e-03
## week_dayMonday                                  5.852612e-03
## week_dayTuesday                                -1.953402e-02
## week_dayWednesday                               2.564763e-02
## week_dayThursday                               -4.587178e-04
## week_dayFriday                                  1.675245e-02
## week_daySaturday                               -2.601010e-02
## week_daySunday                                  4.776716e-03
## COVID_lockdownFALSE                             4.110477e-03
## COVID_lockdownTRUE                             -4.109947e-03
## COVID_pandemicFALSE                            -3.101527e-03
## COVID_pandemicTRUE                              3.101402e-03
## working_hourFALSE                               7.875071e-03
## working_hourTRUE                               -7.875044e-03
## Latitude                                        1.139847e-01
## Longitude                                       9.593571e-02
## city_locationS_Manhattan                        4.992745e-03
## city_locationN_Manhattan                       -4.756823e-02
## city_locationW_Bronx                            3.146970e-02
## city_locationE_Bronx                            1.558811e-02
## city_locationN_E_Queens                        -1.826869e-02
## city_locationN_W_Queens                         2.009564e-03
## city_locationS_Queens                           1.749562e-02
## city_locationN_Brooklyn                        -9.227367e-03
## city_locationS_E_Brooklyn                      -8.005881e-03
## city_locationS_W_Brooklyn                       9.330641e-03
## city_locationW_Staten_Island                   -1.011926e-01
## city_locationE_Staten_Island                   -8.709403e-03
## perp_age<18                                    -7.144389e-02
## perp_age18-24                                  -4.581336e-02
## perp_age25-44                                   5.406089e-02
## perp_age45+                                     1.267833e-01
## vic_age<18                                     -7.397795e-02
## vic_age18-24                                   -1.240023e-02
## vic_age25-44                                    3.036392e-02
## vic_age45+                                      3.391347e-02
## perp_sexF                                       1.601497e-02
## perp_sexM                                      -1.601363e-02
## vic_sexF                                        1.747560e-02
## vic_sexM                                       -1.747380e-02
## perp_raceASIAN/WHITE                            1.489891e-01
## perp_raceBLACK                                 -1.471085e-02
## perp_raceBLACK HISPANIC                        -3.379992e-02
## perp_raceWHITE HISPANIC                         9.781474e-03
## vic_raceASIAN/WHITE                             6.193949e-02
## vic_raceBLACK                                  -9.274449e-04
## vic_raceBLACK HISPANIC                         -5.675617e-02
## vic_raceWHITE HISPANIC                          1.757999e-02
## other_victims                                   4.397806e-02
## jurisdictionPATROL                              5.149342e-02
## jurisdictionHOUSING                            -5.149318e-02
## perp_raceASIAN/WHITE:vic_raceASIAN/WHITE        2.639927e-01
## perp_raceBLACK:vic_raceASIAN/WHITE             -4.481049e-02
## perp_raceBLACK HISPANIC:vic_raceASIAN/WHITE    -1.923715e-01
## perp_raceWHITE HISPANIC:vic_raceASIAN/WHITE     2.529882e-02
## perp_raceASIAN/WHITE:vic_raceBLACK             -6.532200e-02
## perp_raceBLACK:vic_raceBLACK                   -1.499015e-03
## perp_raceBLACK HISPANIC:vic_raceBLACK          -1.406054e-02
## perp_raceWHITE HISPANIC:vic_raceBLACK           2.125830e-02
## perp_raceASIAN/WHITE:vic_raceBLACK HISPANIC    -1.213886e-01
## perp_raceBLACK:vic_raceBLACK HISPANIC          -4.573116e-02
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC -7.383989e-02
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC -4.224034e-02
## perp_raceASIAN/WHITE:vic_raceWHITE HISPANIC     1.383977e-01
## perp_raceBLACK:vic_raceWHITE HISPANIC           8.594714e-03
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC  2.610457e-03
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC  1.822383e-02
## perp_age<18:vic_age<18                         -1.095491e-01
## perp_age18-24:vic_age<18                       -6.119831e-02
## perp_age25-44:vic_age<18                       -1.942468e-02
## perp_age45+:vic_age<18                          2.296742e-02
## perp_age<18:vic_age18-24                       -7.664121e-02
## perp_age18-24:vic_age18-24                     -2.897207e-02
## perp_age25-44:vic_age18-24                      4.631818e-02
## perp_age45+:vic_age18-24                        1.300625e-01
## perp_age<18:vic_age25-44                        8.020179e-03
## perp_age18-24:vic_age25-44                     -1.533855e-02
## perp_age25-44:vic_age25-44                      4.219886e-02
## perp_age45+:vic_age25-44                        7.499659e-02
## perp_age<18:vic_age45+                         -7.101450e-02
## perp_age18-24:vic_age45+                       -5.759527e-02
## perp_age25-44:vic_age45+                        3.684330e-02
## perp_age45+:vic_age45+                          2.020744e-01
## perp_sexF:vic_sexF                             -1.042000e-01
## perp_sexM:vic_sexF                              2.385901e-02
## perp_sexF:vic_sexM                              4.221193e-02
## perp_sexM:vic_sexM                             -2.328960e-02
## year:day_year                                   1.495699e-08

Differently from Lasso, Ridge doesn’t force to zero the coefficient thus is difficult to interpret those numbers.

4.3.3.2 min lambda

plot(ridge.mod, xvar = "lambda", main='Path plot of the Ridge estimates with min lambda\n\n')
abline(v =log(ridge.cv$lambda.min), lty="dashed")

ridge.min.coef <- coef(ridge.mod,  s=ridge.cv$lambda.min)
ridge.min.coef
## 96 x 1 sparse Matrix of class "dgCMatrix"
##                                                           s1
## (Intercept)                                    -5.901296e+00
## day_periodEarlyMorning                          3.135773e-01
## day_periodMorning                               9.849081e-02
## day_periodEarlyAfternoon                       -1.861883e-02
## day_periodAfternoon                            -5.443512e-02
## day_periodEvening                              -3.104734e-02
## day_periodNight                                -2.827166e-02
## day_year                                        5.775967e-05
## year                                           -4.731804e-03
## week_dayMonday                                  1.188938e-02
## week_dayTuesday                                -2.966191e-02
## week_dayWednesday                               8.232225e-02
## week_dayThursday                                8.206412e-04
## week_dayFriday                                  5.329583e-02
## week_daySaturday                               -8.089602e-02
## week_daySunday                                 -5.792982e-03
## COVID_lockdownFALSE                             2.526777e-02
## COVID_lockdownTRUE                             -2.492615e-02
## COVID_pandemicFALSE                            -1.661065e-02
## COVID_pandemicTRUE                              1.638272e-02
## working_hourFALSE                               1.475393e-02
## working_hourTRUE                               -1.468342e-02
## Latitude                                        3.766273e-01
## Longitude                                       1.791574e-02
## city_locationS_Manhattan                        9.093828e-04
## city_locationN_Manhattan                       -1.156430e-01
## city_locationW_Bronx                            6.121018e-02
## city_locationE_Bronx                            5.480075e-03
## city_locationN_E_Queens                        -3.173780e-02
## city_locationN_W_Queens                        -2.218201e-03
## city_locationS_Queens                           5.199380e-02
## city_locationN_Brooklyn                        -3.284680e-03
## city_locationS_E_Brooklyn                       8.959906e-05
## city_locationS_W_Brooklyn                       2.523190e-02
## city_locationW_Staten_Island                   -3.156249e-01
## city_locationE_Staten_Island                   -2.178454e-02
## perp_age<18                                    -1.294470e-01
## perp_age18-24                                  -6.792880e-02
## perp_age25-44                                   8.816658e-02
## perp_age45+                                     1.987393e-01
## vic_age<18                                     -1.414013e-01
## vic_age18-24                                   -9.364481e-03
## vic_age25-44                                    5.350341e-02
## vic_age45+                                      3.656073e-02
## perp_sexF                                       1.128299e-02
## perp_sexM                                      -1.146672e-02
## vic_sexF                                        6.355825e-03
## vic_sexM                                       -6.758703e-03
## perp_raceASIAN/WHITE                            2.288386e-01
## perp_raceBLACK                                 -1.239356e-02
## perp_raceBLACK HISPANIC                        -6.732929e-02
## perp_raceWHITE HISPANIC                         9.041020e-03
## vic_raceASIAN/WHITE                             8.034759e-02
## vic_raceBLACK                                   1.497959e-02
## vic_raceBLACK HISPANIC                         -1.031781e-01
## vic_raceWHITE HISPANIC                          1.735639e-02
## other_victims                                   1.092331e-01
## jurisdictionPATROL                              8.883963e-02
## jurisdictionHOUSING                            -8.866483e-02
## perp_raceASIAN/WHITE:vic_raceASIAN/WHITE        5.127335e-01
## perp_raceBLACK:vic_raceASIAN/WHITE             -1.356727e-01
## perp_raceBLACK HISPANIC:vic_raceASIAN/WHITE    -5.444265e-01
## perp_raceWHITE HISPANIC:vic_raceASIAN/WHITE     9.018283e-03
## perp_raceASIAN/WHITE:vic_raceBLACK             -2.496443e-01
## perp_raceBLACK:vic_raceBLACK                    1.357133e-02
## perp_raceBLACK HISPANIC:vic_raceBLACK          -1.954631e-02
## perp_raceWHITE HISPANIC:vic_raceBLACK           4.459676e-02
## perp_raceASIAN/WHITE:vic_raceBLACK HISPANIC    -3.501667e-01
## perp_raceBLACK:vic_raceBLACK HISPANIC          -6.776895e-02
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC -1.416040e-01
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC -8.677638e-02
## perp_raceASIAN/WHITE:vic_raceWHITE HISPANIC     1.152705e-01
## perp_raceBLACK:vic_raceWHITE HISPANIC           5.360135e-03
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC  9.518035e-03
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC  1.977844e-02
## perp_age<18:vic_age<18                         -2.111777e-01
## perp_age18-24:vic_age<18                       -1.019182e-01
## perp_age25-44:vic_age<18                       -7.585244e-02
## perp_age45+:vic_age<18                          5.114552e-02
## perp_age<18:vic_age18-24                       -1.618143e-01
## perp_age18-24:vic_age18-24                     -2.861700e-02
## perp_age25-44:vic_age18-24                      8.467749e-02
## perp_age45+:vic_age18-24                        2.576952e-01
## perp_age<18:vic_age25-44                        7.116809e-02
## perp_age18-24:vic_age25-44                     -2.026125e-02
## perp_age25-44:vic_age25-44                      6.836001e-02
## perp_age45+:vic_age25-44                        6.925437e-02
## perp_age<18:vic_age45+                         -1.466736e-01
## perp_age18-24:vic_age45+                       -1.658063e-01
## perp_age25-44:vic_age45+                        5.742714e-02
## perp_age45+:vic_age45+                          3.677524e-01
## perp_sexF:vic_sexF                             -2.954253e-01
## perp_sexM:vic_sexF                              2.239350e-02
## perp_sexF:vic_sexM                              7.833969e-02
## perp_sexM:vic_sexM                             -2.075992e-02
## year:day_year                                   2.648577e-08

4.3.3.3 Model comparison

ridge.1se.pred <- predict(ridge.mod, s=ridge.cv$lambda.1se, newdata = shootings.test, type = "response")
ridge.min.pred <- predict(ridge.mod, s=ridge.cv$lambda.min, newdata = shootings.test, type = "response")
par(mfrow=c(1,2))

ridge.1se.roc <- roc(shootings.test$murder_prob ~ ridge.1se.pred, plot=TRUE, print.auc=TRUE, main="1se lambda Ridge ROC curve")
ridge.min.roc <- roc(shootings.test$murder_prob ~ ridge.min.pred, plot=TRUE, print.auc=TRUE, main="min lambda Ridge ROC curve")

The ROC curves are almost identical.

ridge.1se.metrics <- coords(ridge.1se.roc, x="best", ret="all")
row.names(ridge.1se.metrics) <- "ridge 1se lambda"

ridge.min.metrics <- coords(ridge.min.roc, x="best", ret="all")
row.names(ridge.min.metrics) <- "ridge min lambda"
metrics <- rbind(metrics, ridge.1se.metrics, ridge.min.metrics)
(metrics %>% arrange(desc(sensitivity)))[, c("specificity", "sensitivity", "accuracy")]
##                                               specificity sensitivity  accuracy
## Lasso 1se lambda                                0.4550615   0.7049924 0.5145946
## Step model                                      0.5132450   0.6641452 0.5491892
## Significant predictors model                    0.5056764   0.6611195 0.5427027
## Full model                                      0.5288553   0.6414523 0.5556757
## Full model with interaction                     0.5397351   0.6323752 0.5618018
## Significant predictors model with interaction   0.5421003   0.6323752 0.5636036
## ridge 1se lambda                                0.5501419   0.6157337 0.5657658
## Lasso min lambda                                0.6636708   0.5158850 0.6284685
## ridge min lambda                                0.7270577   0.4372163 0.6580180

“Ridge min lambda” model is the worse in terms of sensitivity; while “Ridge 1se lambda” is below average in terms of sensitivity.

Confusion Matrix for lambda.1se:

confusion_matrix(ridge.1se.pred, ridge.1se.metrics$threshold)
##        true
## preds   FALSE TRUE
##   FALSE  1163  254
##   TRUE    951  407

Confusion Matrix for lambda.min:

confusion_matrix(ridge.min.pred, ridge.min.metrics$threshold)
##        true
## preds   FALSE TRUE
##   FALSE  1537  372
##   TRUE    577  289

5 Naive Bayes

nb.fit <- naiveBayes(formula(glm.full), data = shootings.train)
nb.fit
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.7611281 0.2388719 
## 
## Conditional probabilities:
##    day_period
## Y   EarlyMorning    Morning EarlyAfternoon  Afternoon    Evening      Night
##   0   0.04901148 0.06037647     0.10465254 0.20883154 0.20670060 0.37042737
##   1   0.07808374 0.06752169     0.09807620 0.19011694 0.20294229 0.36325915
## 
##    day_year
## Y       [,1]     [,2]
##   0 187.0831 96.53237
##   1 188.4889 98.96107
## 
##    year
## Y       [,1]     [,2]
##   0 2013.540 5.193262
##   1 2013.513 5.337454
## 
##    week_day
## Y      Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday
##   0 0.1369717 0.1226471 0.1156624 0.1186220 0.1340121 0.1875222 0.1845626
##   1 0.1395700 0.1124104 0.1233497 0.1184459 0.1395700 0.1772916 0.1893625
## 
##    COVID_lockdown
## Y        FALSE       TRUE
##   0 0.96247188 0.03752812
##   1 0.96114674 0.03885326
## 
##    COVID_pandemic
## Y       FALSE      TRUE
##   0 0.8024151 0.1975849
##   1 0.7963033 0.2036967
## 
##    working_hour
## Y       FALSE      TRUE
##   0 0.8103469 0.1896531
##   1 0.8193135 0.1806865
## 
##    Latitude
## Y       [,1]       [,2]
##   0 40.74204 0.08895157
##   1 40.74534 0.08970732
## 
##    Longitude
## Y        [,1]       [,2]
##   0 -73.91299 0.06986026
##   1 -73.91067 0.07050189
## 
##    perp_age
## Y          <18      18-24      25-44        45+
##   0 0.12241032 0.45720374 0.38179235 0.03859358
##   1 0.08675971 0.38966428 0.45492267 0.06865334
## 
##    vic_age
## Y          <18      18-24      25-44        45+
##   0 0.11956908 0.35622114 0.44512845 0.07908133
##   1 0.08411920 0.33006413 0.48811769 0.09769898
## 
##    perp_sex
## Y            F          M
##   0 0.02935954 0.97064046
##   1 0.03432667 0.96567333
## 
##    vic_sex
## Y           F         M
##   0 0.1124660 0.8875340
##   1 0.1316484 0.8683516
## 
##    perp_race
## Y   ASIAN/WHITE      BLACK BLACK HISPANIC WHITE HISPANIC
##   0  0.02391382 0.73339647     0.08985439     0.15283533
##   1  0.04903810 0.70803470     0.07883817     0.16408902
## 
##    vic_race
## Y   ASIAN/WHITE      BLACK BLACK HISPANIC WHITE HISPANIC
##   0  0.04545993 0.67704510     0.10915118     0.16834379
##   1  0.06601283 0.66088269     0.08600528     0.18709921
## 
##    other_victims
## Y       [,1]     [,2]
##   0 1.048656 1.904262
##   1 1.719351 2.632683
## 
##    jurisdiction
## Y      PATROL   HOUSING
##   0 0.8318930 0.1681070
##   1 0.8675971 0.1324029
nb.pred <- predict(nb.fit, newdata = shootings.test, type = "raw")
nb.roc <- roc(shootings.test$murder_prob ~ nb.pred[,2], plot=TRUE, print.auc=TRUE, main="Naive Bayes ROC curve")

nb.metrics <- coords(nb.roc, x="best", ret="all")
row.names(nb.metrics) <- "Naive bayes"
metrics <- rbind(metrics, nb.metrics)
(metrics %>% arrange(desc(sensitivity)))[, c("specificity", "sensitivity", "accuracy")]
##                                               specificity sensitivity  accuracy
## Lasso 1se lambda                                0.4550615   0.7049924 0.5145946
## Step model                                      0.5132450   0.6641452 0.5491892
## Significant predictors model                    0.5056764   0.6611195 0.5427027
## Naive bayes                                     0.5444655   0.6444781 0.5682883
## Full model                                      0.5288553   0.6414523 0.5556757
## Full model with interaction                     0.5397351   0.6323752 0.5618018
## Significant predictors model with interaction   0.5421003   0.6323752 0.5636036
## ridge 1se lambda                                0.5501419   0.6157337 0.5657658
## Lasso min lambda                                0.6636708   0.5158850 0.6284685
## ridge min lambda                                0.7270577   0.4372163 0.6580180

Naive Bayes has high sensitivity but not the best one and also has a pretty okey specificity.

Confusion matrix for Naive Bayes:

confusion_matrix(nb.pred[,2], nb.metrics$threshold)
##        true
## preds   FALSE TRUE
##   FALSE  1151  235
##   TRUE    963  426

6 GAM

Now let’s fit a GAM model with all predictors (interaction term included). I applied splines to all the numerical predictors.

f <- update(formula(glm.full.inter), . ~ . + perp_age:vic_age + perp_sex:vic_sex -year -day_year -Latitude -Longitude + s(year) + s(day_year, bs="cc") + s(Latitude) + s(Longitude) -other_victims + s(other_victims))
gam.fit <- gam(f, data=shootings.train, family=binomial)
 
summary(gam.fit)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## murder_prob ~ day_period + week_day + COVID_lockdown + COVID_pandemic + 
##     working_hour + perp_age + vic_age + perp_sex + vic_sex + 
##     perp_race + vic_race + jurisdiction + s(year) + s(day_year, 
##     bs = "cc") + s(Latitude) + s(Longitude) + s(other_victims) + 
##     perp_race:vic_race + perp_age:vic_age + perp_sex:vic_sex
## 
## Parametric coefficients:
##                                                 Estimate Std. Error z value
## (Intercept)                                    -0.728455   0.410341  -1.775
## day_periodMorning                              -0.226595   0.140650  -1.611
## day_periodEarlyAfternoon                       -0.350309   0.130967  -2.675
## day_periodAfternoon                            -0.422978   0.107230  -3.945
## day_periodEvening                              -0.384590   0.104242  -3.689
## day_periodNight                                -0.370610   0.097612  -3.797
## week_dayTuesday                                -0.045968   0.091956  -0.500
## week_dayWednesday                               0.116609   0.090848   1.284
## week_dayThursday                                0.006966   0.091375   0.076
## week_dayFriday                                  0.062237   0.088047   0.707
## week_daySaturday                               -0.112104   0.086299  -1.299
## week_daySunday                                 -0.015481   0.085427  -0.181
## COVID_lockdownTRUE                              0.056426   0.175569   0.321
## COVID_pandemicTRUE                             -0.306546   0.236900  -1.294
## working_hourTRUE                               -0.069236   0.091084  -0.760
## perp_age18-24                                   0.224478   0.179585   1.250
## perp_age25-44                                   0.395946   0.223136   1.774
## perp_age45+                                     0.692933   0.613908   1.129
## vic_age18-24                                    0.213367   0.188010   1.135
## vic_age25-44                                    0.628756   0.199414   3.153
## vic_age45+                                      0.274866   0.358418   0.767
## perp_sexM                                       0.382277   0.337087   1.134
## vic_sexM                                        0.564608   0.358106   1.577
## perp_raceBLACK                                 -1.066301   0.221670  -4.810
## perp_raceBLACK HISPANIC                        -1.668204   0.511983  -3.258
## perp_raceWHITE HISPANIC                        -0.882037   0.276411  -3.191
## vic_raceBLACK                                  -0.984463   0.330294  -2.981
## vic_raceBLACK HISPANIC                         -1.295326   0.529115  -2.448
## vic_raceWHITE HISPANIC                         -0.624343   0.325836  -1.916
## jurisdictionHOUSING                            -0.199834   0.067017  -2.982
## perp_raceBLACK:vic_raceBLACK                    1.103346   0.367819   3.000
## perp_raceBLACK HISPANIC:vic_raceBLACK           1.592625   0.601762   2.647
## perp_raceWHITE HISPANIC:vic_raceBLACK           1.022564   0.413815   2.471
## perp_raceBLACK:vic_raceBLACK HISPANIC           1.233328   0.562835   2.191
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC  1.657084   0.738430   2.244
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC  1.059277   0.594133   1.783
## perp_raceBLACK:vic_raceWHITE HISPANIC           0.759589   0.370246   2.052
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC  1.261960   0.603553   2.091
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC  0.569182   0.406294   1.401
## perp_age18-24:vic_age18-24                      0.020166   0.222845   0.090
## perp_age25-44:vic_age18-24                      0.116118   0.262423   0.442
## perp_age45+:vic_age18-24                        0.192137   0.690454   0.278
## perp_age18-24:vic_age25-44                     -0.327887   0.233299  -1.405
## perp_age25-44:vic_age25-44                     -0.256357   0.266305  -0.963
## perp_age45+:vic_age25-44                       -0.467323   0.644026  -0.726
## perp_age18-24:vic_age45+                       -0.124813   0.403515  -0.309
## perp_age25-44:vic_age45+                        0.094252   0.413197   0.228
## perp_age45+:vic_age45+                          0.202144   0.715398   0.283
## perp_sexM:vic_sexM                             -0.596309   0.365515  -1.631
##                                                Pr(>|z|)    
## (Intercept)                                    0.075857 .  
## day_periodMorning                              0.107167    
## day_periodEarlyAfternoon                       0.007478 ** 
## day_periodAfternoon                            7.99e-05 ***
## day_periodEvening                              0.000225 ***
## day_periodNight                                0.000147 ***
## week_dayTuesday                                0.617151    
## week_dayWednesday                              0.199293    
## week_dayThursday                               0.939232    
## week_dayFriday                                 0.479655    
## week_daySaturday                               0.193939    
## week_daySunday                                 0.856197    
## COVID_lockdownTRUE                             0.747917    
## COVID_pandemicTRUE                             0.195669    
## working_hourTRUE                               0.447177    
## perp_age18-24                                  0.211304    
## perp_age25-44                                  0.075987 .  
## perp_age45+                                    0.259014    
## vic_age18-24                                   0.256430    
## vic_age25-44                                   0.001616 ** 
## vic_age45+                                     0.443149    
## perp_sexM                                      0.256769    
## vic_sexM                                       0.114876    
## perp_raceBLACK                                 1.51e-06 ***
## perp_raceBLACK HISPANIC                        0.001121 ** 
## perp_raceWHITE HISPANIC                        0.001418 ** 
## vic_raceBLACK                                  0.002877 ** 
## vic_raceBLACK HISPANIC                         0.014361 *  
## vic_raceWHITE HISPANIC                         0.055349 .  
## jurisdictionHOUSING                            0.002865 ** 
## perp_raceBLACK:vic_raceBLACK                   0.002702 ** 
## perp_raceBLACK HISPANIC:vic_raceBLACK          0.008130 ** 
## perp_raceWHITE HISPANIC:vic_raceBLACK          0.013471 *  
## perp_raceBLACK:vic_raceBLACK HISPANIC          0.028432 *  
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 0.024828 *  
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.074604 .  
## perp_raceBLACK:vic_raceWHITE HISPANIC          0.040210 *  
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 0.036538 *  
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.161240    
## perp_age18-24:vic_age18-24                     0.927897    
## perp_age25-44:vic_age18-24                     0.658138    
## perp_age45+:vic_age18-24                       0.780800    
## perp_age18-24:vic_age25-44                     0.159892    
## perp_age25-44:vic_age25-44                     0.335726    
## perp_age45+:vic_age25-44                       0.468067    
## perp_age18-24:vic_age45+                       0.757083    
## perp_age25-44:vic_age45+                       0.819565    
## perp_age45+:vic_age45+                         0.777513    
## perp_sexM:vic_sexM                             0.102801    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df  Chi.sq p-value    
## s(year)          7.403  8.348  22.560 0.00509 ** 
## s(day_year)      6.599  8.000  16.504 0.01192 *  
## s(Latitude)      1.001  1.003   2.775 0.09614 .  
## s(Longitude)     1.000  1.001   1.318 0.25105    
## s(other_victims) 8.837  8.978 290.618 < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0462   Deviance explained = 4.69%
## UBRE = 0.061295  Scale est. = 1         n = 11098

Plot of the splines effect:

par(mfrow=c(2,3))

for (i in 1:5){
  plot(gam.fit, select=i, shade=TRUE, shade.col = "lightblue")
  abline(h=0, lty="dashed")
}

As we can see both for the model summary and the plots the effect of Latitude and Longitude seams linear; year and day_year seams periodic and other_victims is highly non linear.

Now let’s try fitting a GAM model now with only significant predictors:

gam2.fit <- update(gam.fit, . ~ . - s(Latitude) -s(Longitude) - week_day -COVID_lockdown -COVID_pandemic -working_hour -vic_sex -perp_sex - perp_age:vic_age - perp_sex:vic_sex)
 
summary(gam2.fit)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## murder_prob ~ day_period + perp_age + vic_age + perp_race + vic_race + 
##     jurisdiction + s(year) + s(day_year, bs = "cc") + s(other_victims) + 
##     perp_race:vic_race
## 
## Parametric coefficients:
##                                                Estimate Std. Error z value
## (Intercept)                                    -0.41239    0.20423  -2.019
## day_periodMorning                              -0.26084    0.12591  -2.072
## day_periodEarlyAfternoon                       -0.39040    0.11492  -3.397
## day_periodAfternoon                            -0.42978    0.10344  -4.155
## day_periodEvening                              -0.36603    0.10273  -3.563
## day_periodNight                                -0.37110    0.09711  -3.821
## perp_age18-24                                   0.12683    0.08403   1.509
## perp_age25-44                                   0.38626    0.08598   4.492
## perp_age45+                                     0.62042    0.12863   4.823
## vic_age18-24                                    0.26904    0.08705   3.091
## vic_age25-44                                    0.33974    0.08701   3.905
## vic_age45+                                      0.32995    0.11303   2.919
## perp_raceBLACK                                 -1.04910    0.22011  -4.766
## perp_raceBLACK HISPANIC                        -1.62322    0.51076  -3.178
## perp_raceWHITE HISPANIC                        -0.86320    0.27436  -3.146
## vic_raceBLACK                                  -0.95882    0.32881  -2.916
## vic_raceBLACK HISPANIC                         -1.29250    0.52898  -2.443
## vic_raceWHITE HISPANIC                         -0.58978    0.32290  -1.827
## jurisdictionHOUSING                            -0.20106    0.06654  -3.022
## perp_raceBLACK:vic_raceBLACK                    1.07409    0.36637   2.932
## perp_raceBLACK HISPANIC:vic_raceBLACK           1.55815    0.60051   2.595
## perp_raceWHITE HISPANIC:vic_raceBLACK           1.02387    0.41187   2.486
## perp_raceBLACK:vic_raceBLACK HISPANIC           1.24700    0.56264   2.216
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC  1.67212    0.73796   2.266
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC  1.06836    0.59351   1.800
## perp_raceBLACK:vic_raceWHITE HISPANIC           0.74504    0.36809   2.024
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC  1.23950    0.60167   2.060
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC  0.55419    0.40375   1.373
##                                                Pr(>|z|)    
## (Intercept)                                    0.043456 *  
## day_periodMorning                              0.038298 *  
## day_periodEarlyAfternoon                       0.000681 ***
## day_periodAfternoon                            3.25e-05 ***
## day_periodEvening                              0.000367 ***
## day_periodNight                                0.000133 ***
## perp_age18-24                                  0.131212    
## perp_age25-44                                  7.04e-06 ***
## perp_age45+                                    1.41e-06 ***
## vic_age18-24                                   0.001997 ** 
## vic_age25-44                                   9.44e-05 ***
## vic_age45+                                     0.003512 ** 
## perp_raceBLACK                                 1.88e-06 ***
## perp_raceBLACK HISPANIC                        0.001483 ** 
## perp_raceWHITE HISPANIC                        0.001654 ** 
## vic_raceBLACK                                  0.003545 ** 
## vic_raceBLACK HISPANIC                         0.014550 *  
## vic_raceWHITE HISPANIC                         0.067769 .  
## jurisdictionHOUSING                            0.002513 ** 
## perp_raceBLACK:vic_raceBLACK                   0.003371 ** 
## perp_raceBLACK HISPANIC:vic_raceBLACK          0.009467 ** 
## perp_raceWHITE HISPANIC:vic_raceBLACK          0.012923 *  
## perp_raceBLACK:vic_raceBLACK HISPANIC          0.026670 *  
## perp_raceBLACK HISPANIC:vic_raceBLACK HISPANIC 0.023459 *  
## perp_raceWHITE HISPANIC:vic_raceBLACK HISPANIC 0.071850 .  
## perp_raceBLACK:vic_raceWHITE HISPANIC          0.042963 *  
## perp_raceBLACK HISPANIC:vic_raceWHITE HISPANIC 0.039390 *  
## perp_raceWHITE HISPANIC:vic_raceWHITE HISPANIC 0.169877    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df Chi.sq p-value    
## s(year)          6.943  8.011   19.0  0.0150 *  
## s(day_year)      6.630  8.000   17.6  0.0078 ** 
## s(other_victims) 8.828  8.975  293.0  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0462   Deviance explained = 4.46%
## UBRE = 0.059628  Scale est. = 1         n = 11098
par(mfrow=c(1,3))

for (i in 1:5){
  plot(gam2.fit, select=i, shade=TRUE, shade.col = "lightblue")
  abline(h=0, lty="dashed")
}

The splines effects doesn’t change much compared to the previews model.

6.1 Model comparison

gam.predict <- predict(gam.fit, newdata = shootings.test, type = "response")
gam2.predict <- predict(gam2.fit, newdata = shootings.test, type = "response")
par(mfrow=c(1,2))

gam.roc <- roc(shootings.test$murder_prob ~ gam.predict, plot=TRUE, print.auc=TRUE, main="GAM 1 ROC curve")
gam2.roc <- roc(shootings.test$murder_prob ~ gam2.predict, plot=TRUE, print.auc=TRUE, main="GAM 2 ROC curve")

The ROC curves are almost identical.

gam.metrics <- coords(gam.roc, x="best", ret="all")
gam2.metrics <- coords(gam2.roc, x="best", ret="all")

row.names(gam.metrics) <- "GAM1"
row.names(gam2.metrics) <- "GAM2"
metrics <- rbind(metrics, gam.metrics, gam2.metrics)
(metrics %>% arrange(desc(sensitivity)))[, c("specificity", "sensitivity", "accuracy")]
##                                               specificity sensitivity  accuracy
## Lasso 1se lambda                                0.4550615   0.7049924 0.5145946
## Step model                                      0.5132450   0.6641452 0.5491892
## Significant predictors model                    0.5056764   0.6611195 0.5427027
## Naive bayes                                     0.5444655   0.6444781 0.5682883
## Full model                                      0.5288553   0.6414523 0.5556757
## GAM2                                            0.5742668   0.6384266 0.5895495
## Full model with interaction                     0.5397351   0.6323752 0.5618018
## Significant predictors model with interaction   0.5421003   0.6323752 0.5636036
## ridge 1se lambda                                0.5501419   0.6157337 0.5657658
## GAM1                                            0.5979186   0.6051437 0.5996396
## Lasso min lambda                                0.6636708   0.5158850 0.6284685
## ridge min lambda                                0.7270577   0.4372163 0.6580180

“GAM1” model is the worst in terms of sensitivity but gives the best results in accuracy and specificity; while “GAM 2” is average in terms of sensitivity but gives the second best accuracy.

7 Conclusions

Let’s find the best model for all the metrics:

  1. specificity
(metrics %>% arrange(desc(specificity)))[, c("specificity", "sensitivity", "accuracy")]
##                                               specificity sensitivity  accuracy
## ridge min lambda                                0.7270577   0.4372163 0.6580180
## Lasso min lambda                                0.6636708   0.5158850 0.6284685
## GAM1                                            0.5979186   0.6051437 0.5996396
## GAM2                                            0.5742668   0.6384266 0.5895495
## ridge 1se lambda                                0.5501419   0.6157337 0.5657658
## Naive bayes                                     0.5444655   0.6444781 0.5682883
## Significant predictors model with interaction   0.5421003   0.6323752 0.5636036
## Full model with interaction                     0.5397351   0.6323752 0.5618018
## Full model                                      0.5288553   0.6414523 0.5556757
## Step model                                      0.5132450   0.6641452 0.5491892
## Significant predictors model                    0.5056764   0.6611195 0.5427027
## Lasso 1se lambda                                0.4550615   0.7049924 0.5145946

The best model in terms of specificity is “Lasso min lambda”.

  1. sensitivity
(metrics %>% arrange(desc(sensitivity)))[, c("specificity", "sensitivity", "accuracy")]
##                                               specificity sensitivity  accuracy
## Lasso 1se lambda                                0.4550615   0.7049924 0.5145946
## Step model                                      0.5132450   0.6641452 0.5491892
## Significant predictors model                    0.5056764   0.6611195 0.5427027
## Naive bayes                                     0.5444655   0.6444781 0.5682883
## Full model                                      0.5288553   0.6414523 0.5556757
## GAM2                                            0.5742668   0.6384266 0.5895495
## Full model with interaction                     0.5397351   0.6323752 0.5618018
## Significant predictors model with interaction   0.5421003   0.6323752 0.5636036
## ridge 1se lambda                                0.5501419   0.6157337 0.5657658
## GAM1                                            0.5979186   0.6051437 0.5996396
## Lasso min lambda                                0.6636708   0.5158850 0.6284685
## ridge min lambda                                0.7270577   0.4372163 0.6580180

The best model in terms of sensitivity is “Lasso 1se lambda”. Unfortunately it has a not acceptable specificity; thus we still prefer the “Step model”.

  1. accuracy
(metrics %>% arrange(desc(accuracy)))[, c("specificity", "sensitivity", "accuracy")]
##                                               specificity sensitivity  accuracy
## ridge min lambda                                0.7270577   0.4372163 0.6580180
## Lasso min lambda                                0.6636708   0.5158850 0.6284685
## GAM1                                            0.5979186   0.6051437 0.5996396
## GAM2                                            0.5742668   0.6384266 0.5895495
## Naive bayes                                     0.5444655   0.6444781 0.5682883
## ridge 1se lambda                                0.5501419   0.6157337 0.5657658
## Significant predictors model with interaction   0.5421003   0.6323752 0.5636036
## Full model with interaction                     0.5397351   0.6323752 0.5618018
## Full model                                      0.5288553   0.6414523 0.5556757
## Step model                                      0.5132450   0.6641452 0.5491892
## Significant predictors model                    0.5056764   0.6611195 0.5427027
## Lasso 1se lambda                                0.4550615   0.7049924 0.5145946

The best model in terms of accuracy is “Lasso min lambda”.

Since the response in unbalanced:

print(dfSummary(shootings.test[,"murder_prob"]), method="render")
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 murder_prob [numeric]
Min : 0
Mean : 0.2
Max : 1
0:2114(76.2%)
1:661(23.8%)
2775 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-19

The best model in terms of prediction power should maximize its ability to correctly classify murders. For this reason i should choose the model with higher sensitivity and acceptable specificity (>0.5): “Step model”:

summary(glm.step)
## 
## Call:
## glm(formula = murder_prob ~ day_period + year + Latitude + perp_age + 
##     vic_age + perp_race + vic_race + other_victims + jurisdiction, 
##     family = binomial, data = shootings.train)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.787131  13.616581  -0.719 0.472285    
## day_periodMorning        -0.246401   0.123987  -1.987 0.046887 *  
## day_periodEarlyAfternoon -0.397900   0.113502  -3.506 0.000455 ***
## day_periodAfternoon      -0.436626   0.101885  -4.285 1.82e-05 ***
## day_periodEvening        -0.394104   0.101080  -3.899 9.66e-05 ***
## day_periodNight          -0.408608   0.095415  -4.282 1.85e-05 ***
## year                     -0.006386   0.004422  -1.444 0.148655    
## Latitude                  0.534628   0.267956   1.995 0.046020 *  
## perp_age18-24             0.132106   0.083252   1.587 0.112554    
## perp_age25-44             0.410781   0.085078   4.828 1.38e-06 ***
## perp_age45+               0.685533   0.126370   5.425 5.80e-08 ***
## vic_age18-24              0.274380   0.086357   3.177 0.001487 ** 
## vic_age25-44              0.339776   0.086189   3.942 8.07e-05 ***
## vic_age45+                0.307450   0.111640   2.754 0.005888 ** 
## perp_raceBLACK           -0.531496   0.132681  -4.006 6.18e-05 ***
## perp_raceBLACK HISPANIC  -0.640503   0.152269  -4.206 2.59e-05 ***
## perp_raceWHITE HISPANIC  -0.487890   0.140572  -3.471 0.000519 ***
## vic_raceBLACK            -0.074958   0.109876  -0.682 0.495111    
## vic_raceBLACK HISPANIC   -0.339954   0.129735  -2.620 0.008783 ** 
## vic_raceWHITE HISPANIC   -0.096896   0.118128  -0.820 0.412069    
## other_victims             0.134627   0.009818  13.712  < 2e-16 ***
## jurisdictionHOUSING      -0.205217   0.065771  -3.120 0.001808 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12203  on 11097  degrees of freedom
## Residual deviance: 11817  on 11076  degrees of freedom
## AIC: 11861
## 
## Number of Fisher Scoring iterations: 4
metrics["Step model", c("specificity", "sensitivity", "accuracy")]
##            specificity sensitivity  accuracy
## Step model    0.513245   0.6641452 0.5491892

Confusion matrix:

confusion_matrix(glm.step.pred, glm.step.metrics$threshold)
##        true
## preds   FALSE TRUE
##   FALSE  1085  222
##   TRUE   1029  439

7.1 Initial questions

Now let’s answer the project proposal questions using the best overall model effects.

Let’s plot the effects for all the predictors:

a <- plot( effect("day_period", glm.step),rescale.axis=FALSE, ylab="Probability of murder")
b <- plot( effect("year", glm.step),rescale.axis=FALSE, ylab="Probability of murder")
c <- plot( effect("Latitude", glm.step),rescale.axis=FALSE, ylab="Probability of murder")
d <- plot( effect("perp_age", glm.step), rescale.axis=FALSE, ylab="Probability of murder")
e <-plot( effect("vic_age", glm.step), rescale.axis=FALSE, ylab="Probability of murder")
f <- plot( effect("perp_race", glm.step), rescale.axis=FALSE, ylab="Probability of murder")
g <-plot( effect("vic_race", glm.step), rescale.axis=FALSE, ylab="Probability of murder")
h <-plot( effect("other_victims", glm.step), rescale.axis=FALSE, ylab="Probability of murder")
i <-plot( effect("jurisdiction", glm.step), rescale.axis=FALSE, ylab="Probability of murder")

grid.arrange(a,b,c,d,e,f,g,h,i, top=textGrob("Predictors effect plots",gp=gpar(fontsize=20,font=3)))

The initial question were:

Question (1): is survival more likely in a specific neighborhood?

The model doesn’t give information about neighborhood as the associated predictor was removed after the multicollinearity analysis. Thus the probability of murder across neighborhood is the same according to this model.

Question (2): is it more likely to survive based on gender/age of the victim? (obviously younger victims should be more likely to survive.)

The model doesn’t give information about victim sex as the associated predictor was removed by step function. Thus the probability of murder is the same for both male and female victims according to this model. Instead we can say something about the age of the victim: clearly there is substantial increase in murder probability when we increase the victim age from <18 to 18-24 years old but from 18-24 to 25-44 and from 25-44 to 45+ there is not a substantial increase. Also the confidence band tells us that the probability of murders from 18 to 45+ years is more or less the same.

Question (3): is it more likely to survive based on the victim’s ethnicity?

The model tells us that it is more likely for a Black Hispanic to survive compared to other races but with a very small effect. Also, watching the confidence bands the probability of murder if the victims are Asian or White or Black or White Hispanic is more or less the same.

Question (4): is it more likely to survive if the shooter belongs to a different/same ethnicity than the victim?

The model doesn’t give information about interaction between ethnicity as the associated predictor was removed by step function. Thus the probability of murder doesn’t change if the shooter belongs to a different/same ethnicity than the victim according to this model.

Question (5): is it more likely to survive based on sex/age/ethnicity of the perpetrator?

The model doesn’t give information about perpetrator sex as the associated predictor was removed by step function. Thus the probability of murder is the same as the perpetrator sex changes. Instead we can say something about perpetrator age and ethnicity:

  • as the perpetrator is older the more likely is for him/her to kill the victim.

  • if the perpetrator is Asian or White it is more likely for him/her to kill the victim. Instead if the perpetrator is Black or Hispanic the murder probability doesn’t change much by watching the confidence bands.

Question (6): Is there a trend with respect to the date on which the incident occurred?

The model only give information about the year as the other date predictors were removed by step function. Unfortunately the model summary and the confidence bands of year effect tells us that this predictor is not significant.

Question (7): Is survival less likely in late hours?

The model tells us that during the early morning and the morning it is less likely to survive a shooting incident. However the confidence band are very large thus the probability of murder could not change much.

Question (8): Is it more likely to survive on a weekday?

The model doesn’t give information about the weekday as the associated predictor was removed by step function. Thus the probability of murder is the same across weekdays according to this model.

Question (9): What happened during the pandemic period?

The model doesn’t give information about the pandemic period as the associated predictors were removed by step function. Thus the probability of murder is the same even during COVID according to this model.

Furthermore the models gives us addiction information:

  • if in the shooting incident other victims are present then the probability of murder greatly increases.

  • if the jurisdiction is patrol the probability of murder increases.

  • if the Latitude increases the probability of murder increases. The model summary tells us that this predictor is slightly significant.